data mining paper
TRANSCRIPT
Using Data Mining to Determine Car Dealer Auction
BehaviorEdward Egros
Candidate for Departmental Distinction in Economics
Advisor: Prof. Tom Fomby
Data Courtesy of DaimlerChrysler Financial Services
Presented May 11, 2006
Table of Contents
Section 1) Introduction…………………………………p. 3
Section 2) Multiple Linear Regression…………….p. 9
Section 3) Regression Tree..…………………………p. 11
Section 4) Neural Network……………………….….p. 17
Section 5) Ensemble Method………………………..p. 21
Section 6) Conclusions…………………………….....p. 24
Section 7) Appendices…………………………………p. 28
Section 8) Works Cited………………………………..p. 36
Abstract: DaimlerChrysler Financial Services hopes to better predict how dealers will bid for their used vehicles at auctions. These predictions will help company executives gain a better sense of the true value of their auctioned vehicles and how much revenue they stand to earn. Because vehicle gross proceeds can take on virtually any dollar amount, data mining offers a number of continuous dependent variable models to help forecast bidding. This paper explores four of these models: a multiple linear regression, a regression tree, an artificial neural network and an Ensemble Method that uses all of the three previous models. Data for this study includes a number of explanatory variable characteristics for each vehicle, including the mileage on the vehicle at the time of auction, the original manufactured suggested retail price and the time the vehicle was sold. These models will use these characteristics to predict the gross proceeds for these vehicles. This study concludes that the Ensemble Method best forecasts gross proceeds for a subset of DaimlerChrysler used vehicles.
1) Introduction
2
DaimlerChrysler Financial Services uses many outlets to sell its used vehicles.
One of the outlets that require the most research is the wholesale dealer auction. Any
vehicles that are either repossessed or its lease to a customer has expired go into a pool of
vehicles. The company then assigns these vehicles to auction sites all over the country
where auction representatives take these vehicles into their lots. There, at the auction
representative’s discretion, vehicles are detailed, meaning any blemishes or malfunctions
are repaired to increase the vehicle’s value. On sale day, these vehicles are then placed
on the auction line and dealers attend the auction to bid on these vehicles. The dealer
with the highest bid then buys the vehicle. However, if a bid is not high enough
according to the auction representative, they can call a “no sale” and refuse to sell the
vehicle to the dealer with the highest bid. The vehicle can then be carried over to the
next sale day or released from the auction site in some capacity, such as to a salvage yard
or to a third party.
Auction representatives want to make sure that vehicles they sell will receive a
fair value. The most efficient way of assuring fair value is to forecast what a dealer will
pay for a vehicle, given certain characteristics. Price forecasting aids auction
representatives and the company in different ways. It protects both the company and the
auction from underselling a vehicle and taking a significant loss that is then absorbed by
the company and the auction. Second, it prevents over-expectations of fair market value
where an auction representative forces too many “no sales” and does not accept the
highest value the vehicle will receive through time. Finally, it acts as a guide for auction
representatives to create a floor value that best incites dealers to bid the highest dollar
figure. Knowing the wholesale value also helps the auction representative make a
3
number of profitable decisions, including: the extent which vehicles are repaired, how to
run the auction to maximize value and whether to sell the vehicle at auction in the first
place. This paper specifically looks at a subset of vehicles that DaimlerChrysler
Financial Services sells on a regular basis.
To accomplish this task, this paper proposes a number of different methods to best
model dealer behavior. All of these methods employ data mining. One well-accepted
definition of data mining is “the science of extracting useful information from large data
sets or databases".1 More specifically, data mining is used to find patterns of behavior.
The objective here is to find patterns of car dealer behavior. Knowing these patterns will
create more valid forecasts when a specific vehicle with given characteristics comes to
auction. This paper proposes four methods to model these behaviors: a multiple linear
regression, a regression tree, an artificial neural network and an Ensemble Method that
employs portions of the three other methods simultaneously.
The vehicles comprising the data used in this paper, 9,573 in all, were sold at
auctions across the country held between February 25, 2004 and July 29, 2005
inclusively. Included in the data are a number of characteristics of each vehicle sold at
auction, with gross proceeds (gross_proc) as the dependent variable and the remaining
variables as explanatory variables. Here is a list of all variables used in this analysis:
Table 1: Variable Definitions
1 Hand, D., Mannila, H., Smyth, P.: Principles of Data Mining. Cambridge, MA. MIT Press, 2001.
4
This paper implements XLMiner software and also utilizes the same explanatory
variables for all four methods. 2 3
There is a five-step process for creating these methods. The data mining software
SAS Enterprise Miner uses the acronym SEMMA to illustrate this five-step process to
new users. SEMMA stands for:
2 Due to the 30 variable limitation of XLMiner, not all of the variables available could be used in this analysis. A two-step process was implemented to determine which variables should be used. First, a backward selection stepwise regression with a 90% significance level with the logged form of gross_proc using Intercooled STATA 7.0 software was used that included all explanatory variables. This means that the regression included all explanatory variables and dropped the largest p-values and reported anything below 0.1. Second, any variable with a t-statistic greater than the absolute value of 5.3 (p-value <.001) was included in this data mining analysis. 3The variable “cpiusedseason” is seasonally-adjusted monthly CPI data of new and used motor vehicles. U.S. Department of Labor, Bureau of Labor Statistics, Consumer Price Indices. 2006. Division of Consumer Prices and Price Indices. 17 May 2006. <http://www.bls.gov/cpi/home.htm#data>.
5
Figure 1: Acronym for Data Mining Protocol
SampleExploreModifyModelAssess4
Firstly, sampling the data means taking small portions of the data out of the pool
and randomly partitioning into three parts: a training set, a validation set and a test set.
This is a diagram of how the data are partitioned:
Figure 2: Data Partitioning
The training and validation sets will be used for the first three methods and the
test set will primarily be used for evaluating the Ensemble Method vis-à-vis the three
methods that make up the ensemble. The original 9,573 observations are randomly
placed into these three partitions and all variables available will be included in the
partitioning. 50% of the data are in the training set, 30% of the data are in the validation
set and 20% of the data are in the test set.
The second step is to explore the data to look for obvious trends, outliers, missing
information, etc. After running outlier analysis and through heuristics, there are some
things worth noting. These factors will help determine whether a model can be used as
4 SAS Technologies/Analytics. 2006. SAS Institute Inc. 17 May 2006. <http://www.sas.com/technologies/analytics/datamining/miner/semma.html>.
20%30%Training Set Validation Set Test Set
6
50%
an accurate predictor. Some of these factors include mileage has a negative effect on the
vehicle’s value and any vehicle cannot realistically have more than 250,000 miles. There
were no notable outliers, making the exploration step a quick one. Thirdly, the data must
be modified to reduce outliers. This data did not include any obvious outliers or
observations with missing information. Modification also includes transforming
variables. For example, variables such as quality category (inventory_) must be
transformed to include binary variables, or “dummies”, for the four possible categories.5
After these tasks are completed, it is time to model the data using the four
methods discussed. Here is a figure that diagrams these four models:
Figure 3: Methods Used
5 All of these vehicles fall under six categories numbered 0-5. A five is the best quality of vehicle, a zero is the worst quality and a three is average. For the sake of this analysis, observations with categories zero and one are dropped. Also, the variable representing category three vehicles is not included in any models to avoid the dummy variable trap and to verify the accuracy of the model (the coefficient for a two vehicle should be negative and the coefficients for fours and fives should be positive).
7
The training set will be used to construct these initial models. The constructed
models will then be scored by means of the validation set and the test set.
Finally, it is time to assess how well these models performed with new data. For
this paper, the main determinant for success is RMSE, or root means squared error. This
statistic comes from taking the average of the squared errors of each observation
prediction and then taking the square root of that mean; the lower the RMSE, the better
the model. For the first three models, the model that best predicts the validation set will
be considered the best model and will be recommended for future auctions and for future
vehicle price forecasting. For the Ensemble Method, the test set will be used to
Multiple Linear Regression
tktptt exxy 110
Regression Tree
Artificial Neural Network Ensemble Method
)(*)(*)(* 3210 NNRTMLRy
8
determine success. The purposes for using three different datasets will be explained later
in the study.
2) Multiple Linear Regression
The first model we explore is the multiple linear regression. This is the simplest
model econometrically and also one of the most popular models used in data mining
exercises. The idea behind this model starts with the algebraic equation for a straight
line. One way to write this equation for a straight line is , where is the
dependent variable that is trying to be modeled, is a coefficient that describes the
slope, is some independent variable and is a -intercept that explains what equals
when either or equal zero. The term does not have to be restricted to one
independent variable, instead it can be a number of factors and each factor can have its
own coefficient. This describes the construction of a “multiple” linear regression model.
Generally, the form of this model is:
Equation 1: Formula for the Multiple Linear Regression Model
;
where is the -intercept constant, represent the coefficients of the
independent factors and is an error term that shows the unexplained parts of the
regression. Ordinary least squares (OLS), a set of matrix algebra equations that
calculates the smallest value between the true value of and the that comes from the
regression, determines what these coefficients should be. For this regression to yield
legitimate results, there are some properties the regression must fulfill. First, the
regression must have a zero conditional mean, or the expected value of the error given all
9
of the explanatory variables equals zero. Second, there cannot be perfect collinearity; in
other words, the independent variables cannot have perfect linear relationships with each
other. Finally, the regression must be homoskedastic, or have a constant error variance.6
If these properties hold, then it can accurately reflect the relationships between the
different explanatory factors and the dependent variable.
In this model, we will use the vehicle’s characteristics previously mentioned as
these explanatory variables and the gross proceeds as the dependent variable. The most
significant characteristics will be included in this regression. Using the training set and
developing the estimates of the model vis-à-vis OLS, XLMiner calculated the model seen
in Appendix 1. Using these estimates, here is a subset of the scores of the validation
dataset:
Table 2: Subset of Scores for Multiple Linear Regression over the Validation Dataset
Row Id.
Predicted Value
Actual Value Residual
3 18582.33558 18800 217.664422211 17951.96275 18800 848.037251214 19889.9885 18800 -1089.988515 18774.51404 18800 25.4859645316 17310.62169 18800 1489.37831319 18705.61518 18800 94.3848171220 20245.20896 18800 -1445.2089623 17896.6315 18800 903.368500927 18158.52965 18800 641.470350831 19419.04455 18800 -619.04454732 17604.87104 18800 1195.12896234 18953.36258 18800 -153.36258242 19440.73648 18800 -640.73647747 19260.89011 18800 -460.89010850 19222.29166 18800 -422.29166453 19013.7063 18800 -213.70630457 19593.69709 18750 -843.697095
6 Wooldridge, Jeffrey. Introductory Econometrics, A Modern Approach. Australia, Thomson South-Western, 2003. pp. 85-95.
10
The RMSE for the training set is $1,499.39 with an average error close to zero;
for the validation set the RMSE is $1,552.76 with an average error of $84.74. Because
the RMSE for both datasets are within $53.37 of each other, the regression does seem to
forecast accurately relative to the original data. The R-squared statistic, a statistic used
for determining how well a regression fits the data, for this model equals 0.725, which is
acceptable given the conditions of the model. It should be expected that the average error
would be significantly more since calculations in OLS includes trying to avoid having
any bias or an average error not equal to zero. Although this property is next to
impossible when introducing new data, a low average error is acceptable. Now that the
multiple linear regression seems sound enough for implementation, it is time to proceed
in creating the other three models and comparing their respective results with the multiple
linear regression to determine which model best predicts dealer behavior.
3) Regression Tree
The next model to explore is the regression tree. This approach can also be
appealing, especially to engineers, because visually it is the easiest to understand despite
the algorithmic difficulties in construction. This is the simplest model that potentially
analyzes non-linear relationships between explanatory variables and the dependent
variable. This approach is unlike the multiple linear regression where only the -
intercept and linear coefficients model the relationship. This process does however
require more time to construct, a factor that should be considered when determining if it
is the most superior and efficient system.
11
An example of a regression tree looks like this:7
Figure 4: Example of a Regression Tree
Diagramming a tree begins with some explanatory variable with a number. This
number is referred to as a decision node. All records, hereafter referred to as
observations, start at this value and eventually fall down the tree according to values from
their respective explanatory variables. From this value there are two extending branches.
Each branch represents a range of values. If an observation has its explanatory value less
than or equal to the number, then the observation goes down to the left; if the
observation’s value of that explanatory variable is greater than the value listed then the
observation falls to the right. Everything underneath this variable hinges on whether the
7Bruce, Peter C., Patel, Nitin R., Shmueli, Galit. Data Mining in Excel: Lecture Notes and Cases. Arlington, VA. Galit Shmueli, Nitin R. Patel, Peter C. Bruce. Resampling Stats, Inc. 2005. p. 124.
12
value from the observation is greater than, less than or equal to the value listed.8 When
reading a regression tree, there will be numbers next to the branches. Those numbers
represent the number of observations that followed that particular branch. After
following the branch, the observation can either come to another decision node or a
terminal node. If it is another decision node then the process starts over again and the
observation falls under one of two branches. If it is a terminal node then the tree ends
and a predicted value for the dependent value is achieved. This one value is the same
predicted value for all observations that fit the same characteristics. Predictions come
from taking the average of all the predicted values of that subset of observations making
up the terminal node. In XLMiner, decision nodes are designated with circles and
terminal nodes are designated with rectangles and a blue font for the predicted value.9
When diagramming a regression tree there are many different options for defining
the branches of a regression tree. As with any data mining model, testing a variety of
different options, or “trial and error”, is the most reliable method to determine which
model performs best. There are three main types of trees: a full tree, a pruned tree and a
minimum error tree. In a full tree, it is the objective to fit virtually all of the observations
onto the tree, meaning the tree accurately captures all of the observations in the training
set. However, this method is subject to the malady of overfitting. Overfitting involves
devising a model that is so accurate with one dataset that it becomes worthless when it
8 One of the reasons why engineers like this model is because of how it controls for outliers. Any observation that is an outlier because of an explanatory variable will not be as powerful to changing the entire model because all explanatory variables fall under one of two categories, unlike a multiple linear regression where an outlier has a lot more clout. 9 Regression tree diagrams may also include rectangles with red text with the words “Sub Tree beneath”. The rules for diagramming trees in XLMiner include the rule that a spreadsheet may only contain so many levels of the tree. If the tree must have more levels than can be displayed, then the “Sub Tree” rectangle will appear. When this happens, check the tabular results to determine how an observation fit within that particular tree. Some trees in this analysis include such “Sub trees”.
13
tries to forecast a completely different set of data. The solution to overfitting is to create
a pruned tree. A pruned tree starts with a full tree and then removes, or “prunes”,
branches from the tree that do not significantly reduce the error rate. To determine this
significance, a chi-square statistic for independence is used. If the node split with the
strongest association to the dependent has a significant p-value according to the chi-
square statistic, then the split is included, otherwise it is pruned. As for the differences
between that and the minimum error tree, while pruned trees use the training set, a
minimum error tree uses the validation set to prune the full tree.
To apply regression trees to this auction data, all terminal nodes will include
dollar value predictions of a specific vehicle with the characteristics given in the decision
nodes. For this problem, four trees with different specifications were developed and the
tree with the lowest RMSE in the validation set will be considered the best tree for
prediction. The chart below provides these different specifications and their respective
results:
14
Table 3: Testing for Best Predictive Regression Tree
Tree Specifications RMSE of Validation Set (Average Error of Validation)
1 - 479 Cases Minimum in Terminal Node- 100 Maximum Number of Splits in for explanatory variables
2124.81(62.41)
2 - 100 Cases Minimum in Terminal Node- 100 Maximum Number of Splits in for explanatory variables
1824.93(60.19)
3 - 100 Cases Minimum in Terminal Node- 50 Maximum Number of Splits in for explanatory variables
1815.84(68.70)
4 - 479 Cases Minimum in Terminal Node- 200 Maximum Number of Splits in for explanatory variables
2070.84(64.62)
From these four trees, it is clear that there is an inconsistent tradeoff between the
RMSE and the average error. The tree with the average error closest to zero does not
have the RMSE closest to zero; at the same time the tree with the lowest average error
does not have the lowest RMSE.
From previous analysis, auction representatives as well as those representing
DaimlerChrysler Financial Services believes that a positive bias as well as a slightly
positive average error behooves the company more than a negative error or even a zero
error. The basis for this logic comes from the power auctions have over prices in the
market. There are potentially drastic consequences to under-forecasting vehicles. An
under-forecasted vehicle could influence how dealers value similar types of vehicles.
15
This devaluation results in losing a lot of money for the company and for auctions. To be
risk averse, the company prefers to slightly over-forecast vehicles to avoid devaluation.
What makes over-forecasting possible is that minor tweaks to prices may not be noticed
by dealers to the point where the company avoids devaluation as well as retains a firm
control over the market price.
Returning to the model, what this behavior means is that average error is not the
most important statistic when it comes to determining the best regression tree; therefore
the tree with the lowest RMSE, as long as the average error is not obscenely larger than
the others, will be the tree recommended as the best. According to the results, the third
tree with the fewest record minimum in a terminal node best predicts the validation data.
Pruning the data by having the fewest number of splits for all explanatory variables of all
the trees is the best method. Appendix 2 includes all split values and terminal node levels
of this tree. Here is a table of how the tree scored some of the validation dataset:
Table 4: Subset of Scores for Regression Tree over the Validation Dataset
Row Id.
Predicted Value
Actual Value Residual
3 19167.85714 18800 -367.85714311 18518.58407 18800 281.41592914 18218.93204 18800 581.06796115 19167.85714 18800 -367.85714316 17578.42466 18800 1221.57534219 19167.85714 18800 -367.85714320 19869.5 18800 -1069.523 17578.42466 18800 1221.57534227 18847.17742 18800 -47.17741931 19571.47887 18800 -771.47887332 17578.42466 18800 1221.57534234 19771.07438 18800 -971.0743842 19771.07438 18800 -971.0743847 19113.11475 18800 -313.11475450 17981.51261 18800 818.48739553 19167.85714 18800 -367.85714357 20156.34921 18750 -1406.34921
16
Also, it is important to note that these trees do not use all of the explanatory
variables given. After completing analysis with regression trees, it is time to look into
how neural networks perform with the data.
4) Neural Network
Of all the data mining methods discussed in this paper, the artificial neural
network is the most difficult to construct and the most difficult to understand
conceptually. Engineering one of these models requires numerous calculations and takes
the longest amount of time to complete of the four models for any statistical software
program. The major advantage though to learning about the neural network is that it best
captures complicated relationships that are non-linear between explanatory variables and
outputs and within explanatory variables that a regression tree is inferior at describing.
The structure of a typical neural network consists of three layers: input layers, a
group of hidden layers10 and an output layer. The input layer is made up of explanatory
variables and is represented by one node for each variable. The output layer consists of
one node that explains the behavior of the explanatory variables. If necessary, any
further calculations between these two layers are referred to as hidden layers. Here is an
example diagram of a neural network:11
10 In XLMiner, the user can specify anywhere from 1 to 4 of these hidden layers. 11 Bruce, Peter C., Patel, Nitin R., Shmueli, Galit. Data Mining in Excel: Lecture Notes and Cases. Arlington, VA. Galit Shmueli, Nitin R. Patel, Peter C. Bruce. Resampling Stats, Inc. 2005. p. 160.
17
Figure 5: Example of Artificial Neural Network
This diagram has one input layer with four input nodes on the left of the diagram,
two hidden layers in-between the input and output layers and one output layer with one
node on the right side. Not listed are numbers written next to the arrows (weights) and
numbers listed before (bias) and just after a node (output). Weights help refine a
prediction since an output consists of a weighted sum of explanatory variables. Biases
are added to, or subtracted from, the output of that node, whether it is the final output or
the output for another hidden layer. Both of these terms take on values usually between
the absolute values of 0.00 to 0.05. To explain fully how a calculation works, the weights
and explanatory variable values go into some function and are then calculated. The bias
is then added or subtracted and that equals the output for either a hidden layer or the final
output. Weights and biases are originally guessed and through the use of “back
18
propagation,” where the potential error from a prediction is calculated and then the
original estimates are then corrected. This system updates weights after each
observation.
The input layer for this particular problem will consist of the same vehicle
characteristics used for the previous two models. The output is the estimated price for
that vehicle. There are four combinations of hidden layers and hidden layer nodes that
will be implemented. The combination with the lowest RMSE or a reasonably low
RMSE and the lowest average error in the validation dataset will be the network
recommended. The settings for all four neural networks will include 300 epochs12, a
weight change momentum13 of 0.4 and an error tolerance of 0.01. The table listed below
includes the results of this study:
Table 5: Testing for Best Neural Network
Neural Network Specifications RMSE of Validation Set (Average Error of Validation)
1 - 1 Hidden Layer- 10 Nodes within Each Layer
3896.28(-3407.20)
2 - 1 Hidden Layer- 25 Nodes within Each Layer
4779.92(-4295.93)
3 - 2 Hidden Layers- 25 Nodes within Each Layer
2887.19(-2266.37)
4 - 4 Hidden Layers- 25 Nodes within Each Layer14
3079.57(-2027.18)
12 An epoch is a run through of the network where the weights are updated using new observations. 13 Weight change momentum retains some portion of the prediction of the earlier weights so that if an outlier is introduced to the network it will not seriously affect the results. 14 These are the maximum specifications in XLMiner for a neural network.
19
From these four neural networks, the one with the lowest RMSE is the third
network, and the model with the average error closest to zero is the fourth network. Once
again, just like with regression trees, there is disagreement over the best performing
model when it came to using either the RMSE or the average error. But unlike the
regression trees, the reasons behind neglecting the average error with regression trees
involved placing more emphasis on a higher average bias. All of the average errors in the
neural networks are negative, so it is more beneficial to the company to have a bias as
close to zero as possible given only negative alternatives. Any significantly negative bias
will only hurt the company’s status in the used vehicle market. The difference between
the two RMSE and the two average errors in the two finalist models is approximately the
same; therefore the fourth model will be considered the best neural network. It is
important to note that building the fourth model with the maximum number of hidden
layers takes much more time than the third neural network.15 If efficiency means more
than minimizing the average error magnitude then the third neural network should be
considered the best predictive model. It is up to the developer of the data mining models
to determine what the opportunity costs are for building a more sophisticated model.
However, for the sake of this paper, the time used to develop different models is not a
factor for predictive quality and so the fourth model will be deemed the best artificial
neural network. Appendix 3 provides the estimates for these weights and nodes. How
this neural network scores the validation dataset can be seen in the following table:
15 In XLMiner, the third neural network took 222.0 seconds to build. The fourth model took 583.0 seconds to build.
20
Table 6: Some Scores for Neural Network Models over the Validation Dataset
Row Id.
Predicted Value
Actual Value Residual
3 21726.92899 18800 -2926.9289911 20296.6756 18800 -1496.675614 22544.30552 18800 -3744.3055215 20910.90054 18800 -2110.9005416 19725.38579 18800 -925.38579219 21690.66788 18800 -2890.6678820 23146.59824 18800 -4346.5982423 20691.12712 18800 -1891.1271227 21441.52709 18800 -2641.5270931 23226.48269 18800 -4426.4826932 20349.19137 18800 -1549.1913734 20265.57072 18800 -1465.5707242 20780.53776 18800 -1980.5377647 21272.99611 18800 -2472.9961150 21074.60784 18800 -2274.6078453 22036.25948 18800 -3236.2594857 21307.45563 18750 -2557.45563
5) Ensemble Method
This paper has compiled the most acceptable linear regression, regression tree and
artificial neural network. Although one model may perform better than the other two,
this does not mean that the other two models should be discounted. In fact, it may be that
the two other models might still hold significance and thus predictive value! This is
possible theoretically thanks to the theorems published by C. Granger and R.
Ramanathan. They devised the Ensemble Method, devoted to implementing and
synthesizing a number of predictive models, including all three methods mentioned
earlier.16 The process works like that of a multiple linear regression.17 In this problem,
there are three explanatory variables: the first is the set of price predictions of the
16 Because XLMiner does not have an Ensemble Method included in its software, this method is generated manually and will be detailed later in the paper. 17 Granger, C. and Ramanathan, R.. “Improved Methods of Combining Forecasting,” Journal of Forecasting. 1984. 3, pp. 197-204.
21
validation data generated by the multiple linear regression; the second is the set of price
predictions from the same data generated by the regression tree and the third comes from
the neural network. The true vehicle gross proceeds will then be placed in a column next
to these three model estimates. After compiling these data, the next step is to retrieve the
same estimates of the three models that were generated from the test data. This will serve
the purpose of the validation data, since the actual validation data is being used as a
training dataset for the Ensemble Method. In XLMiner, the only way to setup the
Ensemble Method is to copy and paste the respective columns into a blank worksheet and
label each column with the type of variable it generates. After building the regression,
the next step is to generate new variables that represent the errors of both the validation
and test datasets. An average error and a root mean squared error will be created from
both datasets and will determine whether the Ensemble Method or one of the original
three models is the best predictor of these vehicles.
It is now time to generate one multiple linear regression with the estimates from
the previous models. The format of the regression generated looks like this:
Equation 2: Estimates of Ensemble Method Based on Validation Dataset Predictions
=1054.43 + (0.39)MLR + (0.09)RT + (0.42)NN; (255.61) (0.04) (0.03) (0.02)
where MLR, RT and NN are the validation dataset predictions for the multiple linear
regression, regression tree and artificial neural network, respectively. All of the four p-
values are less than 0.01, meaning that these estimates are significant to a 99%
confidence. Applying these estimates to the test data, the average error comes to $95.69.
As for the RMSE, that equals $1,081.47. The R-squared reported for this regression
22
using the validation data is 0.76, which makes the regression acceptable as far as fitting
the data of the three models. The scores using the test dataset are in the following table:
Table 7: Scores of the Three Original Methods and the Ensemble Method over the
Test Dataset
Test Set
Row Id.
Predicted Value for
MLRResidual
from MLRPredicted Value for
RTResidual from RT
Predicted Value for
NNResidual from NN
Predictions Using
Ensemble Method
Actual Value
Residual from Ensemble
Method
2 20169.72038 1369.720383 19484.2246 684.224599 22774.95389 3974.953892 20182.43578 18800 1382.4357847 17737.09486 -1062.90514 18392.85714 -407.142857 20541.14421 1741.144213 18198.55266 18800 -601.44733858 17887.12771 -912.872287 19113.11475 313.114754 20139.96794 1339.967936 18151.15752 18800 -648.8424787
13 18419.56646 -380.433537 18518.58407 -281.415929 21931.95774 3131.95774 19059.82972 18800 259.829720122 19875.68535 1075.685348 18218.93204 -581.067961 21789.53167 2989.531669 19545.48193 18800 745.481929224 18795.21086 -4.78914423 19167.85714 367.857143 22002.1839 3202.1839 19292.35938 18800 492.359381825 19154.69982 354.6998237 19113.11475 313.114754 22124.21957 3324.219566 19479.78391 18800 679.783909628 19794.2241 994.2241033 21049.2 2249.2 23108.63182 4308.631818 20309.43821 18800 1509.43820533 18465.05765 -334.942354 19167.85714 367.857143 20884.03034 2084.030341 18694.31686 18800 -105.683136940 19811.59787 1011.597873 21861.84971 3061.849711 23868.71268 5068.712676 20704.57403 18800 1904.57402946 17066.67275 -1733.32725 17860.28037 -939.719626 18809.16919 9.169189 17164.08734 18800 -1635.91266459 17416.50953 -1333.49047 18392.85714 -357.142857 19189.32546 439.325457 17506.34582 18750 -1243.654179
All four methods have been implemented on the test dataset to determine which
model best predicts car dealer behavior at auctions. Three of four models used different
combinations and, using the determinant statistics of the most useful model for each
method, here are the results:
23
Table 8: Summary Results of Four Competing Methods over Test Dataset
Model RMSE(Average error)
Linear Regression $1511.21($10.17)
Regression Tree $2046.73(-$34.06)
Neural Network $2961.63(-$2373.42)
Ensemble Method $1,081.47($95.69)
6) Conclusions
The first glaring observation in looking at the test dataset results is that the
artificial neural network is the worst model in both the average error and RMSE. The
neural network sometimes attempts to model relationships between explanatory variables
and the dependent variable in a way that is far more complicated than reality. Nearly all
of the predictions this model generated were significantly lower than the true gross
proceeds. This anomaly means that the model is negatively biased. One possibility for
this bias is that the original estimates for weights were already inadequately lower than
what they should have been. Despite the unacceptable results, the estimates generated by
the neural network are still important in that they are critical in calculating estimates for
the Ensemble Method. Although there are numerous combinations of the Ensemble
Method that would not include the neural network predictions, the fact that XLMiner
came up with a significant estimate shows its importance. It is also worth noting that the
coefficient for the neural network explanatory variable is the largest of the three models,
0.42.
24
As for the other three models, once again there is an almost linear tradeoff
between RMSE and the average error. If average error is deemed the most important
factor, then the regression tree would be considered the best model. However, the RMSE
of the regression tree is $429.74 worse than the model with the lowest RMSE, the
Ensemble Method. When situations like this in data mining occur, it is best to consult
with the ideals and objectives of the experiment to determine the model to use. In this
case, there is a monetary advantage to having a slightly positive bias.18 There are severe
consequences to devaluating vehicles and the company wishes to avoid these. Ceteris
paribus, there is a precise dollar amount that the company and auctions can increase this
subset of used vehicles without considerable dealer backlash and a loss of faith in the
company’s price forecasting altogether. Unfortunately, determining this precise amount
is difficult. After all, because of market sensitivities, an experiment where each
forecasting period means slightly increasing the bias until the auction notices the dealer
backlash is impossible to conduct practically. Not only is dealer loss of faith difficult to
regain but there is also a last-period, time-sensitive correlation with price forecasts. More
studies and other creative means need to be devised to determine exactly what bias the
market can sustain. For the time being, it is assumed that the market can sustain a
positive bias of $95.69.
For the sake of this study, it is also assumed that the amount of time spent
developing a forecasting model is not a factor in determining the best method; therefore
the Ensemble Method is the best predictor of these used vehicles at auction. In XLMiner,
it took a fraction of the time to build the multiple linear regression model compared with
18 See the discussion in Section 3 about the dangers of under-forecasting vehicles and the company’s mission to be as risk averse as possible.
25
developing the neural network. Also, because XLMiner is the focus of this research and
does not include an Ensemble Method, the time it takes to build the final model is the
sum of running all three models and then taking this time to combine the three models
into the Ensemble Method.
On the whole, used vehicle auction data can be quite volatile. Bids for an auction
are sometimes in $250 increments and a number of things can affect whether any dealer
makes one additional bid or not. These unforeseen and immeasurable things can include:
heightened dealer competition where two or more dealers are vying for a limited
inventory, bad weather conditions where it is more difficult for dealers to show up to an
auction and the way a particular auction is run. For instance, some auctioneers and ring
men19 may be more or less influential when it comes to assuring a dealer will make
another bid. Currently, there is no forecasting model that captures these factors perfectly,
since it is next to impossible to come up with variables that explain how to determine this
behavior.
Despite this white noise, new and innovative ways of price forecasting
implemented with data mining are better capturing relationships between a gamut of
vehicle and auction factors and the true value of used vehicles. This study explores how
three of the most popular continuous variable models performed with actual auction data.
The Ensemble Method performs the best because it takes into account both the simple
and complicated relationships between vehicle characteristics and gross proceeds. The
other three models can only model either a linear relationship or a non-linear, non-
parametric complicated one. What implementing the Ensemble Method also insinuates is
19 Ring men are a group of auction assistants who look at each dealer and, in a hurried fashion, attempt to get more bids out of dealers.
26
that the most sophisticated models for predicting auction bids usually also take the most
time to develop. It is up to executives to determine how much time can be devoted to a
project to where if it is more beneficial to spend more time on developing a better model
or if an inferior model is better at saving opportunity costs. Thankfully, data mining
offers all econometricians and engineers alike these options. Showcasing data mining in
new and different ways will also advance these new modeling techniques to becoming
more adopted and widespread in numerous academic fields.
27
Appendix 1: Multiple Linear Regression Coefficients and Estimates
Input variables Coefficient Std. Error p-value SSConstant term 82981.14844 3421.535645 0 1.72931E+12msrpamount 0.28138652 0.00771736 0 10726190000sold_mileage -0.09676149 0.00166331 0 12501070000inventory_1 -738.397461 92.78156281 0 366366200inventory_3 610.1702881 50.7352829 0 78098100inventory_4 1034.455811 80.36746216 0 490844400auction_loc1 -212.905685 67.29554749 0.00165312 1227785auction_loc3 -651.486572 100.7588577 0 21980950auction_loc5 362.5553589 73.17302704 0.000001 151648000auction_loc8 -604.85968 95.44612122 0 163980000auction_loc9 -496.780975 88.9276123 0.00000004 66948200auction_loc10 -381.332794 89.72070313 0.00002552 134050700c_model_key19 -2050.70532 129.8644867 0 894399100c_model_key25 -1170.71643 185.2324677 0 5389190c_model_key27 -2200.5874 95.7226181 0 1070617000c_model_key28 -25.6804009 675.5402832 0.96969134 157382.5469carmile3 -0.01260703 0.00142604 0 63611690cpiusedseason -737.568237 35.53587341 0 1050569000week_sold1 -570.329224 117.7169495 0.0000017 30346850week_sold3 -627.526917 113.0041199 0.00000005 51884980week_sold10 179.948288 98.53777313 0.06842326 21606660week_sold11 518.2744141 94.07507324 0.00000006 113547500week_sold14 -337.202576 98.73433685 0.00068951 8397428week_sold21 -918.109131 145.2453919 0 70136130week_sold22 -643.478455 117.8486633 0.00000008 52729520week_sold23 -910.053955 127.2749252 0 106006300week_sold24 -1028.31201 132.9921722 0 135160200
28
Appendix 2: Regression Tree Split Values and Terminal Node Levels
Prune Tree Rules
Level NodeID
ParentID SplitVar SplitValu
e Cases LeftChild
RightChild PredVal Node
Type Operator
0 0 N/A c_model_key27 0.5 2872 1 2 19008.59 Decision <=1 1 0 sold_mileage 40848.5 2541 3 4 19488.91 Decision <=1 2 0 sold_mileage 34870 331 21 22 15361.76 Decision <=2 3 1 msrpamount 35421.5 1480 5 6 20636.49 Decision <=2 4 1 sold_mileage 54370 1061 7 8 17931.3 Decision <=3 5 3 cpiusedseason 94.8 837 11 12 19923.85 Decision <=3 6 3 msrpamount 42255.18 643 9 10 21593.98 Decision <=3 7 4 msrpamount 38804.4 708 13 14 18604.52 Decision <=3 8 4 sold_mileage 68966.5 353 27 28 16631.08 Decision <=4 9 6 cpiusedseason 94.5 570 15 16 21280.4 Decision <=4 10 6 N/A N/A 73 N/A N/A 24538.5 Terminal 4 11 5 carmile3 30268.5 253 23 24 21015.63 Decision <=4 12 5 sold_mileage 29244.5 584 19 20 19455.63 Decision <=4 13 7 cpiusedseason 94.5 430 25 26 18013.64 Decision <=4 14 7 cpiusedseason 94.5 278 29 30 19462.63 Decision <=5 15 9 sold_mileage 25823.5 177 47 48 22420.7 Decision <=5 16 9 sold_mileage 25218.5 393 17 18 20739.8 Decision <=6 17 16 N/A N/A 104 N/A N/A 22063.81 Terminal 6 18 16 msrpamount 40846 289 31 32 20214.25 Decision <=5 19 12 sold_mileage 19440.5 252 45 46 20182.03 Decision <=5 20 12 carmile3 30574.5 332 35 36 18932.57 Decision <=2 21 2 sold_mileage 24627 166 55 56 16128.79 Decision <=2 22 2 sold_mileage 46820.5 165 41 42 14704.67 Decision <=5 23 11 sold_mileage 26769 184 53 54 21520.97 Decision <=5 24 11 N/A N/A 69 N/A N/A 19771.07 Terminal 5 25 13 N/A N/A 91 N/A N/A 19167.86 Terminal 5 26 13 carmile3 41141.5 339 33 34 17725.09 Decision <=4 27 8 cpiusedseason 94.8 279 37 38 16950.6 Decision <=4 28 8 N/A N/A 74 N/A N/A 15502.11 Terminal 5 29 14 N/A N/A 63 N/A N/A 20477.38 Terminal 5 30 14 sold_mileage 49138 215 39 40 19103.48 Decision <=7 31 18 sold_mileage 34212.5 178 49 50 20721.75 Decision <=7 32 18 N/A N/A 111 N/A N/A 19484.22 Terminal 6 33 26 sold_mileage 46833.5 215 51 52 18105.22 Decision <=6 34 26 N/A N/A 124 N/A N/A 17019.13 Terminal 6 35 20 sold_mileage 37906 265 61 62 19184.63 Decision <=6 36 20 N/A N/A 67 N/A N/A 17981.51 Terminal 5 37 27 N/A N/A 61 N/A N/A 17860.28 Terminal 5 38 27 carmile3 27261.5 218 43 44 16687.53 Decision <=6 39 30 inventory_3 0.5 147 57 58 19463.6 Decision <=6 40 30 N/A N/A 68 N/A N/A 18218.93 Terminal 3 41 22 N/A N/A 99 N/A N/A 15089.7 Terminal 3 42 22 N/A N/A 66 N/A N/A 13946.04 Terminal 6 43 38 sold_mileage 59917 150 59 60 17015.75 Decision <=6 44 38 N/A N/A 68 N/A N/A 16036.37 Terminal 6 45 19 N/A N/A 62 N/A N/A 20838.1 Terminal 6 46 19 sold_mileage 25389 190 65 66 19889.93 Decision <=6 47 15 N/A N/A 66 N/A N/A 23103.74 Terminal 6 48 15 N/A N/A 111 N/A N/A 22045.9 Terminal 8 49 31 N/A N/A 93 N/A N/A 21219.93 Terminal 8 50 31 N/A N/A 85 N/A N/A 20156.35 Terminal
29
7 51 33 N/A N/A 110 N/A N/A 18458.03 Terminal 7 52 33 N/A N/A 105 N/A N/A 17578.42 Terminal 6 53 23 N/A N/A 102 N/A N/A 21861.85 Terminal 6 54 23 N/A N/A 82 N/A N/A 21049.2 Terminal 3 55 21 N/A N/A 60 N/A N/A 16574.29 Terminal 3 56 21 N/A N/A 106 N/A N/A 15821.05 Terminal 7 57 39 N/A N/A 83 N/A N/A 19198.3 Terminal 7 58 39 N/A N/A 64 N/A N/A 19869.5 Terminal 7 59 43 N/A N/A 80 N/A N/A 17264.08 Terminal 7 60 43 N/A N/A 70 N/A N/A 16676.68 Terminal 7 61 35 inventory_3 0.5 191 63 64 19313.38 Decision <=7 62 35 N/A N/A 74 N/A N/A 18847.18 Terminal 8 63 61 N/A N/A 100 N/A N/A 19113.11 Terminal 8 64 61 N/A N/A 91 N/A N/A 19571.48 Terminal 7 65 46 N/A N/A 94 N/A N/A 20094.23 Terminal 7 66 46 N/A N/A 96 N/A N/A 19638.98 Terminal
Appendix 3: Example of Artificial Neural Network Weights and Node Values
30
Hidden Layer # 1 msrpamount sold_mileage inventory_1 inventory_3 inventory_4 auction_loc1 auction_loc3 auction_loc5Node # 1 -0.9924 -0.92508 -0.10909 -0.0869 0.407445 0.76707 -0.30193 -0.03286Node # 2 -1.17377 -0.14221 0.324732 0.077698 0.732684 0.889584 -0.98856 -0.7258Node # 3 1.681852 -1.30234 0.163891 -0.54717 0.596856 -0.03677 0.358691 -0.29186Node # 4 -1.17747 0.028954 0.014453 1.440397 1.636011 -0.16739 -0.32554 0.075728Node # 5 -0.55085 0.875159 0.23121 -0.08847 1.125314 -0.7746 0.251863 0.207554Node # 6 0.136688 -0.06929 0.170625 -0.91841 0.241333 1.168893 0.112296 0.388781Node # 7 -0.88426 0.079155 -0.33953 0.412718 -0.52903 0.672348 -0.75731 -0.96897Node # 8 -1.48219 -2.02359 -0.20219 0.05322 -1.46616 0.148482 0.523352 -0.22174Node # 9 -2.21439 -1.62425 -1.30926 -1.16685 -0.16121 0.99316 -0.46173 0.540713Node # 10 0.599278 -0.53332 0.704159 -0.64314 -0.68084 -0.05174 0.871066 -0.86043Node # 11 -0.89016 -0.6622 -0.03572 -0.17735 0.411346 -0.05405 -0.32155 0.445525Node # 12 -0.66279 0.820028 -0.24676 1.047421 0.816548 0.99647 0.897403 -0.26335Node # 13 0.437691 -1.13075 0.165526 0.975609 0.49331 -0.57517 -0.42918 -0.08316Node # 14 -1.45984 2.243076 0.118265 -0.18112 0.219713 0.096708 0.295095 -0.98391Node # 15 -0.15003 0.555863 -0.79 0.08103 -0.47099 -0.80839 1.021082 -0.43369Node # 16 2.631468 -1.22108 -0.49396 0.265158 0.481147 -0.27958 -0.32635 -0.60925Node # 17 0.842513 -1.26125 0.054213 1.31948 0.070621 0.370988 0.660617 0.639514Node # 18 -1.1525 -1.37519 -0.12184 0.41435 1.376342 -0.5832 1.191347 0.247147Node # 19 0.701494 0.533896 -0.05387 0.252024 -0.24887 0.371217 -0.10109 0.100897Node # 20 -1.45107 -1.62176 0.240022 -0.07923 -0.05257 -0.15995 0.693413 -0.14446Node # 21 0.9122 -2.06603 -0.10715 -0.23796 -0.80318 -0.96537 -0.79056 0.624025Node # 22 2.33234 -1.53131 0.099541 0.52869 0.943848 -0.00983 1.240352 -0.10791Node # 23 1.482287 -0.98135 0.487451 0.133631 0.572966 0.462336 -0.59029 0.171184Node # 24 0.705824 -1.04552 -1.13635 -0.07596 0.169277 -0.86072 -0.03822 -1.39362Node # 25 -0.70268 -1.40022 -0.15968 0.615309 0.270217 -0.84642 -0.41743 -0.13798
auction_loc8 auction_loc9 auction_loc10 c_model_key19 c_model_key25 c_model_key27 c_model_key28 carmsrp3-0.68719 -0.1695 -0.64015 0.409003 0.027659 0.436983 -0.32028 -0.02445-0.01476 -0.31733 -0.30253 -0.97444 -0.83175 -1.61418 -0.79489 -1.246920.512693 0.732171 -0.20315 -0.25761 0.459377 -0.43849 0.833435 -0.7809
31
-1.03672 0.066675 -0.37611 0.561342 0.837288 1.831136 0.253165 -1.04480.192486 -0.70182 0.833364 0.206985 -0.16358 -0.08897 0.479617 0.041510.700279 -0.58293 0.43277 0.425912 -1.1256 -0.33346 0.019433 -0.10582-0.53146 -0.32691 -0.55511 0.916883 -0.40672 -1.08508 0.748814 -0.48829-0.36053 -0.16751 0.389014 0.72763 -0.05283 1.561751 -0.42445 -1.00985-0.55497 -0.44734 -0.97941 0.394627 -0.17005 -0.26687 -0.22782 -0.28239-0.88747 -0.58158 -0.28296 -0.98666 -0.36893 -0.24161 -0.88248 -0.540630.627369 0.237745 0.899651 -0.97675 0.292983 0.631681 -0.60964 -0.30602-0.30476 -0.61264 -1.05206 -0.20284 0.198626 -0.31322 0.428692 -0.683730.863118 -0.54035 0.167506 0.192717 0.795492 -0.22119 -0.61908 -0.602010.336286 -0.19178 -0.63855 -0.03573 -0.54581 1.329112 0.339764 -0.6277-0.80173 0.930607 -0.03662 1.528065 0.32384 -1.03271 -0.79569 -0.876270.61515 0.610025 0.585676 -0.87385 0.075709 -0.58927 -0.82107 -0.288990.08563 0.745317 0.860211 -0.21497 -0.76958 -0.40456 0.250641 0.999438
0.024398 -0.01478 1.220478 -0.19288 -0.25458 -0.80071 0.061673 -1.234440.09506 0.623617 -0.53474 -0.22707 -0.52724 1.032895 -0.89921 -0.08121
0.434759 0.679767 0.288873 -0.42937 -0.2202 0.330814 -0.97329 1.428648-1.12235 -0.48783 -0.4254 0.131853 -0.88971 0.017354 -0.89144 0.213313-0.40495 -0.67428 -0.21395 -0.30772 -0.26346 0.116944 0.332774 0.0696480.160148 -0.66679 -0.4654 0.498713 -0.55199 -0.81764 0.224658 -0.56217-1.12072 0.451109 -0.90296 -0.32911 0.477865 -0.73074 0.188448 -0.0550.241846 -0.21691 -0.79459 -0.07977 -1.23787 -0.74746 -0.19797 -1.05825
cpiusedseasonweek_sold
1week_sold
3week_sold
10week_sold
11week_sold
14week_sold
21week_sold
22week_sold
23week_sold
24 Bias Node-1.06856 0.494343 -0.34477 0.536594 0.813539 -0.96874 -0.4093 -0.27045 -0.164 0.593274 -0.67055-0.79364 0.310549 0.217381 -0.01888 -0.90315 -0.21823 0.697298 0.260858 -0.30171 -0.89301 0.190636-0.19797 -0.73587 -1.09392 -0.21148 0.097434 -0.60608 0.218231 -0.57452 -0.41566 0.779051 -0.69324-1.31937 -0.18955 -0.21545 0.509938 0.252948 -0.60545 -0.18481 -0.37123 -0.31644 0.339214 0.881749-1.23701 -0.84133 1.092705 -0.13309 -0.16496 0.739499 -0.33448 0.934445 0.706528 -0.80172 1.0484220.012575 0.927546 -0.38402 0.565068 1.127216 0.643576 0.036541 0.134394 0.839714 -0.47786 -0.52905
32
-0.1259 -0.73849 0.680244 0.516041 0.795527 0.566929 0.574377 0.388258 -0.52722 -0.1002 -1.14265-0.09004 -0.38251 0.064623 -0.68686 0.987403 -0.36321 0.709633 0.293998 -0.08194 -0.29232 0.723154-0.16078 0.594917 -0.34634 0.100551 0.862145 -0.30645 -0.72148 0.069513 -1.21487 0.621487 1.2294630.35686 -0.45992 0.574555 0.15774 0.78205 0.029443 -0.37114 -0.54483 -0.82437 -0.45879 -0.55105
-0.05049 -0.65733 -0.23109 0.706275 0.0179 0.751328 -0.64764 -0.76806 0.283248 -0.20831 -0.917050.991489 0.649908 0.72425 -0.17619 -1.07845 0.053003 -0.62702 -0.52565 0.829766 -0.74425 -0.65573-1.33509 0.159883 -0.40129 0.913664 -0.50971 -0.73716 0.631569 -0.80927 -0.23147 -0.02393 0.240345-0.80704 0.753918 0.516593 0.213614 0.75346 -0.05771 -0.52011 0.775751 -0.05586 0.161895 -1.35774-1.39803 -0.2068 -1.18577 0.121322 0.712316 -0.93368 -0.65725 0.406796 0.144823 0.443699 0.475491-0.08599 0.86134 0.191324 -0.61797 0.19939 -0.29108 0.109975 -0.23358 1.081557 -0.54301 -0.81373-1.13369 -0.18077 -0.67046 0.218585 -0.13868 0.034448 -0.44913 0.066341 0.2013 -0.94501 -0.251730.28419 0.141433 0.421484 0.159309 0.474259 1.039231 -1.07307 0.080598 0.556284 -0.60161 -0.32247
-1.28863 0.469951 0.259576 0.308309 -0.78201 -0.04043 0.341024 0.566419 0.067478 0.590179 0.365647-0.05973 -0.6187 -0.81371 0.170875 -0.04853 0.06619 -0.06075 -0.39958 -0.25524 0.757934 0.747168-0.28903 -1.18107 0.801829 -0.80847 -0.60603 0.195567 0.11167 1.007777 -0.96406 -0.55747 0.568838-1.57976 -0.32695 0.316019 0.004535 1.074185 0.34285 -0.45953 0.173553 0.782916 0.284625 -0.994210.72044 -0.36463 -0.24892 0.422995 -0.01425 -0.37706 -0.73148 0.698259 0.602804 0.091654 -1.28872
-0.55409 0.790707 0.662302 0.656484 0.765029 0.375535 -0.50792 -0.68322 -0.32642 0.592788 -0.87042-0.14981 0.315957 -0.08987 -0.27936 -0.91763 -0.83759 -0.96482 -0.49006 0.294802 -0.40715 -0.34972
Hidden Layer # 2 Node # 1
Node # 2
Node # 3
Node # 4
Node # 5
Node # 6
Node # 7
Node # 8
Node # 9
Node # 10
Node # 11
Node # 12
Node # 13
Node # 14
Node # 1 0.641476 0.881 0.137 1.369 0.143 0.343 0.233 0.1213 -0.28 -0.77 0.108 0.072 0.133 0.662Node # 2 0.326106 -0.22 0.262 -0.06 -0.4 0.172 -0.46 0.2271 -1.46 0.023 -0.51 0.209 -0.13 1.603Node # 3 -0.44992 -0.92 1.99 -2.2 -2.14
-0.352 -1.19 -2.647 -2.88 1.084 -0.73 -0.35 0.386
-2.154
33
Node # 4 -1.09216 -1 -0.98 1.147 0.356 0.186 -0.92 0.5818 -1.22 -0.08 -0.31 -0.5 -0.87 0.092Node # 5 -1.03697 -1.22 -0.17 0.337 0.624 0.059 0.543 0.3743 0.62 -1 0.656 -0.15 -0.32 1.384Node # 6 -0.80484 -0.88 -0.89 0.267 -0.69 0.198 0.211 -0.958 0.174 0.553 -0.14 -1.38 -0.15
-0.908
Node # 7 -0.06145 0.394 -0.58 1.065 0.24
-0.401 0.517 0.927 -0.55 0.051 0.706 -0.42 0.507 0.297
Node # 8 0.676753 -0.55 0.64 0.814 0.612 0.313 0.025 1.0707 -0.01 0.469 -0.08 0.635 0.21
-0.554
Node # 9 -1.18214 -0.57 -0.16 -0.86 0.31
-0.057 0.345 -2.003 -1.39 -1.56 -0.77 -1.97 -1.39
-1.762
Node # 10 -0.91463 -1.51 0.084 -1.21 -0.26 0.683 -0.35 0.5599 0.637 0.1 0.403 -0.15 -1.73 0.183Node # 11 0.163182 -1.23 -0.9 0.39 1.122 -0.66 -0.62 -0.774 -0.85 -0.01 0.542 0.054 0.433 0.185Node # 12 -0.51169 -0.26 0.895 -1.43 -1.19 0.296 -0.61 -1.216 -0.67 -0.03 -0.33 -1.06 0.097
-0.526
Node # 13 -0.87908 -1.76 0.533 -1.07 0.26
-0.355 0.07 -1.297 0.041 -0.15 -0.22 -1.47 -0.12 0.05
Node # 14 0.216068 -0.61 -0.05 -1.96 -0.96 0.449 -0.13 -1.141 0.17 0.104 -0.78 -1.25 -0.91
-0.821
Node # 15 -0.86761 0.6 0.645 -1.44 -0.41 0.212 0.586 -0.287 0.05 -0.34 -0.84 0.244 0.638
-0.349
Node # 16 0.296629 -0.75 -0.3 -1.09 -0.51 0.027 0.116 -1.735 -0.61 -0.77 0.189 -0.86 -0.66
-1.777
Node # 17 -1.12164 -1.16 0.049 -0.41 0.584
-0.585 -0.11 0.2556 -0.31 -1.35 0.07 -0.15 -0.28 1.271
Node # 18 0.273078 -0.11 -0.51 0.383 0.135 0.568 -0.91 0.0316 -1.27 0.264 -0.79 1.166 -0.54 0.784Node # 19 0.711892 -0.1 -0.68 -0.83 0.211 0.721 -0.43 -1.53 -0.78 0.248 -0.94 -0.71 -0.89
-1.593
Node # 20 -0.98573 -0.17 0.601 -0.06 -0.84 0.351 0.808 -0.932 -0.1 -0.96 -1.33 -1.25 0.685 0.028Node # 21 -0.35658 -0.81 0.123 0.504 -0.33 1.672 0.563 0.1947 -0.95 -1.28 -0.05 -0.72 -0.6 1.674Node # 22 0.499712 -0.26 -1.43 -0.08 0.294 0.051 0.315 -0.451 -0.34 0.712 -0.33 -0.01 -0.95 0.383Node # 23 -0.47264 -0.24 -0.47 1.056 -0.5 0.254 -0.15 -1.144 -1.15 0.29 -0.29 0.6 0.291
-0.297
Node # 24 -0.01629 -1.23 -0.06 0.428 0.715 0.007 -0.2 -3.072 -0.82 -0.02 -0.83 1.347 -0.67 1.297Node # 25 -0.57772 0.492 -0.59 0.174 0.697 -0.34 -0.39 -0.881 -1.35 -1.07 0.184 0.3 -0.52 0.883
Node # 15
Node # 16
Node # 17
Node # 18
Node # 19
Node # 20
Node # 21
Node # 22
Node # 23
Node # 24
Node # 25
Bias Node
1.244763 0.02912 -0.1031 1.157162 0.450834 0.194279 -0.61398 -0.57911 -0.53875 -0.83773 0.837261 1.6963160.273078 0.10875 -1.266 -0.99166 0.808083 0.584741 0.190806 -1.03974 0.701266 -0.33424 -0.92276 -0.6153-1.07216 2.565761 0.201533 -2.22554 1.229969 -1.86793 0.688202 2.477933 1.516446 1.120388 -1.79749 -1.18336-0.35629 0.675587 -0.60258 0.090882 0.886545 0.012681 -0.29487 -1.2136 -0.72036 -0.30958 -0.78495 -0.46509-0.89842 0.508858 -0.85685 0.323749 -0.22631 -0.61501 -0.79171 -0.08494 -0.7582 -0.46453 -1.14866 -0.60771
34
0.572054 -0.95611 -0.3928 0.587011 -1.26224 -0.60551 0.224318 -0.56344 -0.41754 -0.38871 0.479654 0.419889-0.47751 0.015747 -0.56429 -1.23699 0.400402 0.159208 -0.38083 -0.65296 -0.94126 -0.66863 -1.26128 1.1079521.349779 -0.84557 0.48411 -1.6511 0.041749 -0.3498 -0.20354 -0.70874 -0.77405 -0.95613 -0.95117 0.815853-0.32766 0.844564 1.710168 -0.10044 -0.27813 -1.20942 0.079218 1.68803 1.307386 -0.62585 -0.32097 -0.61407-0.19574 -0.63468 0.10598 -0.11142 1.104343 1.098824 -0.14116 -0.41047 -0.10022 -0.16065 -0.37257 -0.75687-0.36492 0.342522 -0.45905 -0.4999 -0.94589 0.189367 -1.21236 -1.21605 -0.24188 -0.18036 -0.4876 0.100611-0.33247 1.16239 -0.20059 -0.99515 -0.72356 -0.17477 0.063635 0.090606 1.083367 -0.33553 0.486218 -0.85417-1.02061 0.446933 0.67339 -0.51139 0.231413 0.117244 -0.0942 0.698364 0.725336 0.319327 -0.25956 -2.22817-0.90352 -1.12433 0.614415 -0.4327 -1.62498 0.178787 0.35591 0.220546 0.433427 -0.19372 0.123669 -0.96407-1.22952 -0.02288 -1.26346 -0.07106 -0.73255 -0.62538 -0.5453 -0.0214 0.109355 0.558338 0.974134 -0.57895-1.28699 0.401999 0.683317 0.193033 0.423317 -1.41028 0.214163 1.475142 0.645822 0.560228 0.308556 -1.36346-0.03543 0.57141 -0.62931 -0.73949 0.088486 -0.85081 -1.20375 0.408516 -0.46815 -0.29167 -0.37848 0.0729810.956775 -0.31538 -0.6183 -0.78966 1.336611 0.666733 -0.13133 -0.77713 -0.10526 -0.14157 -0.94197 -0.57089-0.69907 -0.22111 0.108956 -0.09149 -0.9328 1.29771 -0.10924 0.211597 -0.09594 0.733091 -0.17564 -0.408021.008849 1.0713 0.674911 0.146671 -0.52633 0.320565 0.477679 0.18593 0.323665 0.752229 0.151594 -0.752670.061606 0.833173 -1.71056 -1.16896 -0.96654 -1.70131 -0.91769 -0.57042 1.420147 -0.22581 -0.21925 -0.898760.495311 -0.83239 -0.08449 -0.45795 -0.49285 -0.75871 -0.71121 -0.16968 0.165758 -0.58555 0.025711 -0.472830.970387 -0.14237 0.347591 -0.91277 -0.32448 -0.87333 -0.81388 0.155109 0.304517 -1.33 -0.50764 -0.522860.368653 -0.5301 -0.00858 -0.14667 0.779272 -0.55311 -1.35974 -0.20032 -0.46016 -0.86184 -0.17088 -0.47223
0.724866 -0.27228 -1.17552 -1.03725 -0.33357 -1.72588 -1.37698 0.868067 0.155631 -0.81155 -0.19297 -0.04056
Output Layer
Node # 1
Node # 2
Node # 3
Node # 4
Node # 5
Node # 6
Node # 7
Node # 8
Node # 9
Node # 10
Node # 11
Node # 12
Node # 13
Node # 14
Output Node 0.74122
-1.133 2.579
-1.087
-1.057 -0.03
-0.321
-0.547 1.277 -1.11 -0.71 0.3171 0.061 0.2974
Node # 15
Node # 16
Node # 17
Node # 18
Node # 19
Node # 20
Node # 21
Node # 22
Node # 23
Node # 24
Node # 25
Bias Node
0.441386 0.78956-
0.73704-
0.41418-
0.70685 0.810797-
1.04499-
0.81083 0.134049-
1.04509-
0.47947 0.76785
Works Cited
Bruce, Peter C., Patel, Nitin R., Shmueli, Galit. Data Mining in Excel: Lecture Notes and
Cases. Arlington, VA. Galit Shmueli, Nitin R. Patel, Peter C. Bruce.
Resampling Stats, Inc. 2005.
35
Granger, C. and Ramanathan, R.. “Improved Methods of Combining Forecasting,”
Journal of Forecasting. 1984.
Hand, D., Mannila, H., Smyth, P.: Principles of Data Mining. Cambridge, MA. MIT
Press, 2001.
SAS Technologies/Analytics. 2006. SAS Institute Inc. 17 May 2006.
<http://www.sas.com/technologies/analytics/datamining/miner/semma.html>.
U.S. Department of Labor, Bureau of Labor Statistics, Consumer Price Indices. 2006.
Division of Consumer Prices and Price Indices. 17 May 2006.
<http://www.bls.gov/cpi/home.htm#data>.
Wooldridge, Jeffrey. Introductory Econometrics, A Modern Approach. Australia,
Thomson South-Western. 2003.
36