estimation as a data mining task

Estimation As A Data Mining Task

Theoretical Understanding

Vinh Ngo & Mike Ellis

Introduction

• Estimation– Predicting values not in predetermined

categories

• Three main techniques– Regression– Decision trees– Neural networks

Regression

• Linear regression–

• Method of Least Squares– –

• Easy technique to use– Excel “Data Analysis”– Excel chart example

βXαY

s

ii

s

iii

xx

yyxx

1

2

1

)(

))((

xy

Excel Linear Regression Example

Other Regression Forms

•Multiple regression–

•Polynomial equation

–

– Define new variables

– Equation becomes

2211 XXY

33

221 XXXY

332211 XXXY

X X and ,X X X, X 33

221

Decision Trees

• Regression tree– Leaves → average values

• Model tree– Leaves → linear regression models

• Discretizing the data– Convert continuous data into discrete

partitions– Threshold values

Threshold Values

• Entropy– Measure of purity–

• Information gain– Expected reduction in entropy due to

partitioning–

– Maximize for best threshold

c

iii ppSEntropy

12log)(

)(

|||| )()(),(

AValuevvS

S SEntropySEntropyASGain v

CART Algorithm

• Classification and Regression Tree– Grow a tree that overfits data– Prune the tree– Select best subtree

Decision Trees

• Strengths– Understandable– Which fields are most important

• Weaknesses– Intended for discrete data– Time to grow and prune tree

Comparison Example

CPU Performance Data

cycle main memory cache channels performance

time (Kb) (Kb)

(ns) min max min max

MYCT MMIN MMAX CACH CHMIN CHMAX PRP

1 125 256 6000 256 16 128 198

2 29 8000 32000 32 8 32 269

3 29 8000 32000 32 8 32 220

4 29 8000 32000 32 8 32 172

5 29 8000 16000 32 8 16 132

…

207 125 2000 8000 0 2 14 52

208 480 512 8000 32 0 0 67

209 480 1000 4000 0 0 0 45

Linear Regression Result

PRP = - 56.1+ 0.049 MYCT+ 0.015 MMIN+ 0.006 MMAX+ 0.630 CACH - 0.270 CHMIN+ 1.46 CHMAX

Regression TreeCHMIN

CACH

MMAXMMAX CHMAX

MMAXCACH

MMIN

MYCT

≤ 7.5 > 7.5

64.6(24 points)

75.7(10 points)

133(16 points)

783(5 points)

29.8(37 points)

19.3(28 points)

157(21 points)

281(11 points)

492(7 points)

59.3(24 points)

37.3(19 points)

18.3(7 points)

≤ 8.5

≤ 2500

≤ 550

≤ 0.5

≤ 12000

≤ 58

≤ 28000

≤ 10000

> 28 > 28000

> 58

> 12000

> 550

> 4250

> 10000

(8.5,28]

(2500, 4250]

(0.5,8.5]

Model TreeCHMIN

CACH

MMAX

MMAXCACH

≤ 7.5 > 7.5

LM4(50 points)

LM1(65 points)

LM5(21 points)

LM6(23 points)

LM3(24 points)

LM2(26 points)

≤ 8.5

≤ 4250

≤ 0.5

≤ 28000 > 28000

> 4250

> 8.5

(0.5,8.5]

LM1 PRP = 8.29 + 0.004 MMAX + 2.77 CHMINLM2 PRP = 20.3 + 0.004 MMIN – 3.99 CHMIN

+ 0.946 CHMAXLM3 PRP = 38.1 + 0.012 MMINLM4 PRP =19.5 + 0.002 MMAX + 0.698 CACH

+ 0.969 CHMAXLM5 PRP = 285 – 1.46 MYCT + 1.02 CACH

– 9.39 CHMINLM6 PRP = -65.8 + 0.03 MMIN – 2.94 CHMIN

+ 4.98 CHMAX

Side-By-Side

PRP = – - 56.1–+ 0.049 MYCT–+ 0.015 MMIN–+ 0.006 MMAX–+ 0.630 CACH– - 0.270 CHMIN–+ 1.46 CHMAX

CHMIN

CACH

MMAXMMAX CHMAX

MMAXCACH

MMIN

MYCT

≤ 7.5 > 7.5

64.6(24 points)

75.7(10 points)

133(16 points)

783(5 points)

29.8(37 points)

19.3(28 points)

157(21 points)

281(11 points)

492(7 points)

59.3(24 points)

37.3(19 points)

18.3(7 points)

≤ 8.5

≤ 2500

≤ 550

≤ 0.5

≤ 12000

≤ 58

≤ 28000

≤ 10000

> 28 > 28000

> 58

> 12000

> 550

> 4250

> 10000

(8.5,28]

(2500, 4250]

(0.5,8.5]

CHMIN

CACH

MMAX

MMAXCACH

≤ 7.5 > 7.5

LM4(50 points)

LM1(65 points)

LM5(21 points)

LM6(23 points)

LM3(24 points)

LM2(26 points)

≤ 8.5

≤ 4250

≤ 0.5

≤ 28000 > 28000

> 4250

> 8.5

(0.5,8.5]

LM1 PRP = 8.29 + 0.004 MMAX + 2.77 CHMINLM2 PRP = 20.3 + 0.004 MMIN – 3.99 CHMIN

+ 0.946 CHMAXLM3 PRP = 38.1 + 0.012 MMINLM4 PRP =19.5 + 0.002 MMAX + 0.698 CACH

+ 0.969 CHMAXLM5 PRP = 285 – 1.46 MYCT + 1.02 CACH

– 9.39 CHMINLM6 PRP = -65.8 + 0.03 MMIN – 2.94 CHMIN

+ 4.98 CHMAX

Linear regressionRegression tree

Model tree

Simple Neural Network

3.6

0.80.00.4

Prediction

IncomeGenderAge

2.0 3.5

Output Layer

Hidden Layer

Input Layer

1.9

Building the Neural Net

• Recursive process– Assign initial weights– Run training values through network– Compare results to actual value

• Backpropagation– Pass errors back through net– Incorrect node gets less influence

• Military metaphor• Recurrent networks• Genetic algorithms• Simulated annealing

Neural Networks

• Strengths– Accurate– Fast to use– Handle missing or corrupt data well

• Weaknesses– Not intuitive– Don’t handle large numbers of

predictors well– Data preprocessing

Neural Net Example

•Four years of 30-minute trading data, 1985-1988

–1986 & 1987 for training–1985 for testing

•USD/CHF•Single layer model

–Input nodes: 7–Hidden nodes: 7

•Two layer model–Input nodes: 7–Hidden nodes: 5/2

•Output–Value between -1.0 and 1.0–Rise or fall?

Model % correct

Return %

Random 50.0 ----

Linear 52.5 -6.8

1 hidden layer

53.4 9.9

2 hidden layers

54.0 9.8

Average network

53.7 11.5

Accuracy of Models

• Overfitting– Applies to all three– Independent test data

• Statistical measures

Statistical Measuresmean-squared error

n

apn

iii

1

2)(

root mean squared error

n

apn

iii

1

2)(

mean absolute error

n

apn

iii

1

relative squared error

n

i i

ii

aa

ap

12

2

)(

)( where

n

aa i i

root relative squared error

n

i i

ii

aa

ap

12

2

)(

)(

relative absolute error

n

i i

ii

aa

ap

1

correlation coefficient

AP

PA

SS

S, where

1

))((

n

aappS i iiPA ,

1

)( 2

n

aaS

i

A

1

)( 2

n

ppS

i

P , and

Any questions?

Special Bonus Slide

estimation as a data mining task

Documents