14pgp030

13
BAUDM ASSIGNMENT Assignment 2 Submitted By: Pranav Aggarwal 14PGP030

Upload: pranav-aggarwal

Post on 13-Dec-2015

245 views

Category:

Documents


14 download

DESCRIPTION

Data mining

TRANSCRIPT

Page 1: 14PGP030

BaUDM Assignment

Assignment 2

Submitted By:

Pranav Aggarwal14PGP030

Page 2: 14PGP030

QUESTION: Competitive Auctions on eBay.com.

The file eBayAuctions.xls contains information on 1972 auctions transacted on eBay.com during May-June 2004. The goal is to use these data to build a model that will classify competitive auctions from noncompetitive ones. A competitive auction is defined as an auction with at least two bids placed on the item auctioned. The data include variables that describe the item (auction category), the seller (his/her eBay rating), and the auction terms that the seller selected (auction duration, opening price, currency, day-of-week of auction close). In addition, we have the price at which the auction closed. The goal is to predict whether or not the auction will be competitive.

ANSWER:

SET SEED=14091992.USE ALL.COMPUTE filter_$=(uniform(1)<=.70).VARIABLE LABELS filter_$ 'Approximately 70% of the cases (SAMPLE)'.FORMATS filter_$ (f1.0).FILTER BY filter_$.EXECUTE.FILTER OFF.USE ALL.EXECUTE

1.Discriminant Analysis

Wilks' LambdaTest of Function(s)

Wilks' Lambda

Chi-square df Sig.

1 .924 111.572 4 .000

Standardized Canonical

Discriminant Function

CoefficientsFunction

1

Page 3: 14PGP030

sellerRating -.142ClosePrice 1.128OpenPrice -.926Duration -.039

Classification Resultsa,b

Competitive?

Predicted Group Membership

Total

0 1

Cases Selected

Original

Count0 628 26 6541 452 301 753

%0 96.0 4.0 100.01 60.0 40.0 100.0

Cases Not Selected

Original

Count0 245 7 2521 169 144 313

%0 97.2 2.8 100.01 54.0 46.0 100.0

a. 66.0% of selected original grouped cases correctly classified.b. 68.8% of unselected original grouped cases correctly classified.

According to the analysis, wilk’s lamda is close to 1, hence this may not be a good method of analysis and may not predict accurate model,

Also, according to confusion matrix, accuracy of the model is of 66% which is very low.

2.Logistic Regression

Block 0: Beginning Block  

Classification Tablea,b

Observed PredictedSelected Casesc Unselected Casesd

Page 4: 14PGP030

Competitive?

Percentage Correct

Competitive?

Percentage Correct

0 1 0 1

Step 0

Competitive?

0 0 654 .0 0 252 .01 0 753 100.0 0 313 100.0

Overall Percentage

53.5 55.4

a. Constant is included in the model.b. The cut value is .500c. Selected cases Approximately 70% of the cases (SAMPLE) EQ 1d. Unselected cases Approximately 70% of the cases (SAMPLE) NE 1

Block 1: Method = Enter  

Classification Tablea

Observed PredictedSelected Casesb Unselected Casesc

Competitive?

Percentage Correct

Competitive?

Percentage Correct

0 1 0 1

Step 1

Competitive?

0 526 128 80.4 203 49 80.61 193 560 74.4 70 243 77.6

Overall Percentage

77.2 78.9

a. The cut value is .500b. Selected cases Approximately 70% of the cases (SAMPLE) EQ 1c. Unselected cases Approximately 70% of the cases (SAMPLE) NE 1

Variables in the EquationB S.E. Wald df Sig. Exp(B)

Step 1a

Category 53.801 17 .000Category(1) .141 .276 .262 1 .609 1.151Category(2) -1.029 .332 9.593 1 .002 .357Category(3) -.095 .400 .056 1 .813 .910

Page 5: 14PGP030

Category(4) .168 .617 .074 1 .786 1.182Category(5) -1.321 .382 11.968 1 .001 .267Category(6) -1.149 .576 3.977 1 .046 .317Category(7) .000 .249 .000 1 .999 1.000Category(8) -.077 .574 .018 1 .893 .926Category(9) .974 .537 3.289 1 .070 2.649Category(10) -1.609 .715 5.064 1 .024 .200Category(11) -1.588 .459 11.957 1 .001 .204Category(12) -.139 .338 .170 1 .680 .870Category(13) -.446 .355 1.576 1 .209 .640Category(14) -.074 .228 .106 1 .745 .928Category(15) .823 1.269 .421 1 .517 2.277Category(16) .064 .673 .009 1 .925 1.066Category(17) -.528 .380 1.929 1 .165 .590currency 10.680 2 .005currency(1) -.536 .239 5.042 1 .025 .585currency(2) .953 .532 3.205 1 .073 2.593sellerRating .000 .000 7.872 1 .005 1.000Duration 10.442 4 .034Duration(1) -1.627 .846 3.694 1 .055 .197Duration(2) -.327 .344 .904 1 .342 .721Duration(3) -.059 .302 .038 1 .845 .943Duration(4) -.531 .267 3.965 1 .046 .588endDay 16.707 6 .010endDay(1) .408 .408 1.000 1 .317 1.504endDay(2) .977 .400 5.966 1 .015 2.657endDay(3) .185 .405 .208 1 .648 1.203endDay(4) .234 .397 .347 1 .556 1.264endDay(5) .074 .550 .018 1 .893 1.077endDay(6) .408 .411 .982 1 .322 1.503ClosePrice .091 .009 97.425 1 .000 1.095OpenPrice -.105 .010 102.684 1 .000 .901Constant .013 .443 .001 1 .977 1.013

a. Variable(s) entered on step 1: Category, currency, sellerRating, Duration, endDay, ClosePrice, OpenPrice.

According to the above analysis, Accuracy is better and improved from 53.5% to 77.2%.

Page 6: 14PGP030

Also, not all variables are significant. Those which are significant are:

Category(2) -1.029Category(5) -1.321Category(6) -1.149Category(10) -1.609Category(11) -1.588currencycurrency(1) -0.536sellerRating 0DurationDuration(4) -0.531endDayendDay(2) 0.977ClosePrice 0.091OpenPrice -0.105

Page 7: 14PGP030

3.Tree

Page 8: 14PGP030

Classification

Page 9: 14PGP030

Sample Observed Predicted

0 1 Percent Correct

Training

0 566 88 86.5%

1 135 618 82.1%

Overall Percentage 49.8% 50.2% 84.2%

Test

0 212 40 84.1%

1 45 268 85.6%

Overall Percentage 45.5% 54.5% 85.0%

Growing Method: CHAIDDependent Variable: Competitive?

So according to the analysis, most important relation is competitive.Accuracy percentage in this model is found to be 84.2%. The other model have higher accuracy percentage. Therefore, we don’t use this model

4.Neural Networks

Classification

Sample Observed Predicted

0 1 Percent Correct

Training

0 551 103 84.3%

1 124 629 83.5%

Overall Percent 48.0% 52.0% 83.9%

Testing

0 215 37 85.3%

1 50 263 84.0%

Overall Percent 46.9% 53.1% 84.6%

Dependent Variable: Competitive?

Page 10: 14PGP030

Area Under the Curve

Area

Competitive?0 .907

1 .907

ROC curve is a plot between sensitivity and 1-specificity. The more it is towards the left, the better it is.Area under the curve = 0.907From the classification table, the percentage of accuracy is 83.9%.

Comparing all the four models, the highest accuracy % is 84.2% from Tree.