estimation as a data mining task

22
Estimation As A Data Mining Task Theoretical Understanding Vinh Ngo & Mike Ellis

Upload: nate

Post on 15-Jan-2016

33 views

Category:

Documents


2 download

DESCRIPTION

Estimation As A Data Mining Task. Theoretical Understanding. Vinh Ngo & Mike Ellis. Introduction. Estimation Predicting values not in predetermined categories Three main techniques Regression Decision trees Neural networks. Regression. Linear regression Method of Least Squares - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Estimation As A Data Mining Task

Estimation As A Data Mining Task

Theoretical Understanding

Vinh Ngo & Mike Ellis

Page 2: Estimation As A Data Mining Task

Introduction

• Estimation– Predicting values not in predetermined

categories

• Three main techniques– Regression– Decision trees– Neural networks

Page 3: Estimation As A Data Mining Task

Regression

• Linear regression–

• Method of Least Squares– –

• Easy technique to use– Excel “Data Analysis”– Excel chart example

βXαY

s

ii

s

iii

xx

yyxx

1

2

1

)(

))((

xy

Page 4: Estimation As A Data Mining Task

Excel Linear Regression Example

Page 5: Estimation As A Data Mining Task

Other Regression Forms

•Multiple regression–

•Polynomial equation

– Define new variables

– Equation becomes

2211 XXY

33

221 XXXY

332211 XXXY

X X and ,X X X, X 33

221

Page 6: Estimation As A Data Mining Task

Decision Trees

• Regression tree– Leaves → average values

• Model tree– Leaves → linear regression models

• Discretizing the data– Convert continuous data into discrete

partitions– Threshold values

Page 7: Estimation As A Data Mining Task

Threshold Values

• Entropy– Measure of purity–

• Information gain– Expected reduction in entropy due to

partitioning–

– Maximize for best threshold

c

iii ppSEntropy

12log)(

)(

|||| )()(),(

AValuevvS

S SEntropySEntropyASGain v

Page 8: Estimation As A Data Mining Task

CART Algorithm

• Classification and Regression Tree– Grow a tree that overfits data– Prune the tree– Select best subtree

Page 9: Estimation As A Data Mining Task

Decision Trees

• Strengths– Understandable– Which fields are most important

• Weaknesses– Intended for discrete data– Time to grow and prune tree

Page 10: Estimation As A Data Mining Task

Comparison Example

CPU Performance Data        

  cycle main memory cache channels performance

  time (Kb) (Kb)      

  (ns) min max   min max  

  MYCT MMIN MMAX CACH CHMIN CHMAX PRP

1 125 256 6000 256 16 128 198

2 29 8000 32000 32 8 32 269

3 29 8000 32000 32 8 32 220

4 29 8000 32000 32 8 32 172

5 29 8000 16000 32 8 16 132

…              

207 125 2000 8000 0 2 14 52

208 480 512 8000 32 0 0 67

209 480 1000 4000 0 0 0 45

Page 11: Estimation As A Data Mining Task

Linear Regression Result

PRP = - 56.1+ 0.049 MYCT+ 0.015 MMIN+ 0.006 MMAX+ 0.630 CACH - 0.270 CHMIN+ 1.46 CHMAX

Page 12: Estimation As A Data Mining Task

Regression TreeCHMIN

CACH

MMAXMMAX CHMAX

MMAXCACH

MMIN

MYCT

≤ 7.5 > 7.5

64.6(24 points)

75.7(10 points)

133(16 points)

783(5 points)

29.8(37 points)

19.3(28 points)

157(21 points)

281(11 points)

492(7 points)

59.3(24 points)

37.3(19 points)

18.3(7 points)

≤ 8.5

≤ 2500

≤ 550

≤ 0.5

≤ 12000

≤ 58

≤ 28000

≤ 10000

> 28 > 28000

> 58

> 12000

> 550

> 4250

> 10000

(8.5,28]

(2500, 4250]

(0.5,8.5]

Page 13: Estimation As A Data Mining Task

Model TreeCHMIN

CACH

MMAX

MMAXCACH

≤ 7.5 > 7.5

LM4(50 points)

LM1(65 points)

LM5(21 points)

LM6(23 points)

LM3(24 points)

LM2(26 points)

≤ 8.5

≤ 4250

≤ 0.5

≤ 28000 > 28000

> 4250

> 8.5

(0.5,8.5]

LM1 PRP = 8.29 + 0.004 MMAX + 2.77 CHMINLM2 PRP = 20.3 + 0.004 MMIN – 3.99 CHMIN

+ 0.946 CHMAXLM3 PRP = 38.1 + 0.012 MMINLM4 PRP =19.5 + 0.002 MMAX + 0.698 CACH

+ 0.969 CHMAXLM5 PRP = 285 – 1.46 MYCT + 1.02 CACH

– 9.39 CHMINLM6 PRP = -65.8 + 0.03 MMIN – 2.94 CHMIN

+ 4.98 CHMAX

Page 14: Estimation As A Data Mining Task

Side-By-Side

PRP = – - 56.1–+ 0.049 MYCT–+ 0.015 MMIN–+ 0.006 MMAX–+ 0.630 CACH– - 0.270 CHMIN–+ 1.46 CHMAX

CHMIN

CACH

MMAXMMAX CHMAX

MMAXCACH

MMIN

MYCT

≤ 7.5 > 7.5

64.6(24 points)

75.7(10 points)

133(16 points)

783(5 points)

29.8(37 points)

19.3(28 points)

157(21 points)

281(11 points)

492(7 points)

59.3(24 points)

37.3(19 points)

18.3(7 points)

≤ 8.5

≤ 2500

≤ 550

≤ 0.5

≤ 12000

≤ 58

≤ 28000

≤ 10000

> 28 > 28000

> 58

> 12000

> 550

> 4250

> 10000

(8.5,28]

(2500, 4250]

(0.5,8.5]

CHMIN

CACH

MMAX

MMAXCACH

≤ 7.5 > 7.5

LM4(50 points)

LM1(65 points)

LM5(21 points)

LM6(23 points)

LM3(24 points)

LM2(26 points)

≤ 8.5

≤ 4250

≤ 0.5

≤ 28000 > 28000

> 4250

> 8.5

(0.5,8.5]

LM1 PRP = 8.29 + 0.004 MMAX + 2.77 CHMINLM2 PRP = 20.3 + 0.004 MMIN – 3.99 CHMIN

+ 0.946 CHMAXLM3 PRP = 38.1 + 0.012 MMINLM4 PRP =19.5 + 0.002 MMAX + 0.698 CACH

+ 0.969 CHMAXLM5 PRP = 285 – 1.46 MYCT + 1.02 CACH

– 9.39 CHMINLM6 PRP = -65.8 + 0.03 MMIN – 2.94 CHMIN

+ 4.98 CHMAX

Linear regressionRegression tree

Model tree

Page 15: Estimation As A Data Mining Task

Simple Neural Network

3.6

0.80.00.4

Prediction

IncomeGenderAge

2.0 3.5

Output Layer

Hidden Layer

Input Layer

1.9

Page 16: Estimation As A Data Mining Task

Building the Neural Net

• Recursive process– Assign initial weights– Run training values through network– Compare results to actual value

• Backpropagation– Pass errors back through net– Incorrect node gets less influence

• Military metaphor• Recurrent networks• Genetic algorithms• Simulated annealing

Page 17: Estimation As A Data Mining Task

Neural Networks

• Strengths– Accurate– Fast to use– Handle missing or corrupt data well

• Weaknesses– Not intuitive– Don’t handle large numbers of

predictors well– Data preprocessing

Page 18: Estimation As A Data Mining Task

Neural Net Example

•Four years of 30-minute trading data, 1985-1988

–1986 & 1987 for training–1985 for testing

•USD/CHF•Single layer model

–Input nodes: 7–Hidden nodes: 7

•Two layer model–Input nodes: 7–Hidden nodes: 5/2

•Output–Value between -1.0 and 1.0–Rise or fall?

Model % correct

Return %

Random 50.0 ----

Linear 52.5 -6.8

1 hidden layer

53.4 9.9

2 hidden layers

54.0 9.8

Average network

53.7 11.5

Page 19: Estimation As A Data Mining Task

Accuracy of Models

• Overfitting– Applies to all three– Independent test data

• Statistical measures

Page 20: Estimation As A Data Mining Task

Statistical Measuresmean-squared error

n

apn

iii

1

2)(

root mean squared error

n

apn

iii

1

2)(

mean absolute error

n

apn

iii

1

relative squared error

n

i i

ii

aa

ap

12

2

)(

)( where

n

aa i i

root relative squared error

n

i i

ii

aa

ap

12

2

)(

)(

relative absolute error

n

i i

ii

aa

ap

1

correlation coefficient

AP

PA

SS

S, where

1

))((

n

aappS i iiPA ,

1

)( 2

n

aaS

i

A

1

)( 2

n

ppS

i

P , and

Page 21: Estimation As A Data Mining Task

Any questions?

Page 22: Estimation As A Data Mining Task

Special Bonus Slide