estimation as a data mining task
DESCRIPTION
Estimation As A Data Mining Task. Theoretical Understanding. Vinh Ngo & Mike Ellis. Introduction. Estimation Predicting values not in predetermined categories Three main techniques Regression Decision trees Neural networks. Regression. Linear regression Method of Least Squares - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Estimation As A Data Mining Task](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814937550346895db67aba/html5/thumbnails/1.jpg)
Estimation As A Data Mining Task
Theoretical Understanding
Vinh Ngo & Mike Ellis
![Page 2: Estimation As A Data Mining Task](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814937550346895db67aba/html5/thumbnails/2.jpg)
Introduction
• Estimation– Predicting values not in predetermined
categories
• Three main techniques– Regression– Decision trees– Neural networks
![Page 3: Estimation As A Data Mining Task](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814937550346895db67aba/html5/thumbnails/3.jpg)
Regression
• Linear regression–
• Method of Least Squares– –
• Easy technique to use– Excel “Data Analysis”– Excel chart example
βXαY
s
ii
s
iii
xx
yyxx
1
2
1
)(
))((
xy
![Page 4: Estimation As A Data Mining Task](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814937550346895db67aba/html5/thumbnails/4.jpg)
Excel Linear Regression Example
![Page 5: Estimation As A Data Mining Task](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814937550346895db67aba/html5/thumbnails/5.jpg)
Other Regression Forms
•Multiple regression–
•Polynomial equation
–
– Define new variables
– Equation becomes
2211 XXY
33
221 XXXY
332211 XXXY
X X and ,X X X, X 33
221
![Page 6: Estimation As A Data Mining Task](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814937550346895db67aba/html5/thumbnails/6.jpg)
Decision Trees
• Regression tree– Leaves → average values
• Model tree– Leaves → linear regression models
• Discretizing the data– Convert continuous data into discrete
partitions– Threshold values
![Page 7: Estimation As A Data Mining Task](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814937550346895db67aba/html5/thumbnails/7.jpg)
Threshold Values
• Entropy– Measure of purity–
• Information gain– Expected reduction in entropy due to
partitioning–
– Maximize for best threshold
c
iii ppSEntropy
12log)(
)(
|||| )()(),(
AValuevvS
S SEntropySEntropyASGain v
![Page 8: Estimation As A Data Mining Task](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814937550346895db67aba/html5/thumbnails/8.jpg)
CART Algorithm
• Classification and Regression Tree– Grow a tree that overfits data– Prune the tree– Select best subtree
![Page 9: Estimation As A Data Mining Task](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814937550346895db67aba/html5/thumbnails/9.jpg)
Decision Trees
• Strengths– Understandable– Which fields are most important
• Weaknesses– Intended for discrete data– Time to grow and prune tree
![Page 10: Estimation As A Data Mining Task](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814937550346895db67aba/html5/thumbnails/10.jpg)
Comparison Example
CPU Performance Data
cycle main memory cache channels performance
time (Kb) (Kb)
(ns) min max min max
MYCT MMIN MMAX CACH CHMIN CHMAX PRP
1 125 256 6000 256 16 128 198
2 29 8000 32000 32 8 32 269
3 29 8000 32000 32 8 32 220
4 29 8000 32000 32 8 32 172
5 29 8000 16000 32 8 16 132
…
207 125 2000 8000 0 2 14 52
208 480 512 8000 32 0 0 67
209 480 1000 4000 0 0 0 45
![Page 11: Estimation As A Data Mining Task](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814937550346895db67aba/html5/thumbnails/11.jpg)
Linear Regression Result
PRP = - 56.1+ 0.049 MYCT+ 0.015 MMIN+ 0.006 MMAX+ 0.630 CACH - 0.270 CHMIN+ 1.46 CHMAX
![Page 12: Estimation As A Data Mining Task](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814937550346895db67aba/html5/thumbnails/12.jpg)
Regression TreeCHMIN
CACH
MMAXMMAX CHMAX
MMAXCACH
MMIN
MYCT
≤ 7.5 > 7.5
64.6(24 points)
75.7(10 points)
133(16 points)
783(5 points)
29.8(37 points)
19.3(28 points)
157(21 points)
281(11 points)
492(7 points)
59.3(24 points)
37.3(19 points)
18.3(7 points)
≤ 8.5
≤ 2500
≤ 550
≤ 0.5
≤ 12000
≤ 58
≤ 28000
≤ 10000
> 28 > 28000
> 58
> 12000
> 550
> 4250
> 10000
(8.5,28]
(2500, 4250]
(0.5,8.5]
![Page 13: Estimation As A Data Mining Task](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814937550346895db67aba/html5/thumbnails/13.jpg)
Model TreeCHMIN
CACH
MMAX
MMAXCACH
≤ 7.5 > 7.5
LM4(50 points)
LM1(65 points)
LM5(21 points)
LM6(23 points)
LM3(24 points)
LM2(26 points)
≤ 8.5
≤ 4250
≤ 0.5
≤ 28000 > 28000
> 4250
> 8.5
(0.5,8.5]
LM1 PRP = 8.29 + 0.004 MMAX + 2.77 CHMINLM2 PRP = 20.3 + 0.004 MMIN – 3.99 CHMIN
+ 0.946 CHMAXLM3 PRP = 38.1 + 0.012 MMINLM4 PRP =19.5 + 0.002 MMAX + 0.698 CACH
+ 0.969 CHMAXLM5 PRP = 285 – 1.46 MYCT + 1.02 CACH
– 9.39 CHMINLM6 PRP = -65.8 + 0.03 MMIN – 2.94 CHMIN
+ 4.98 CHMAX
![Page 14: Estimation As A Data Mining Task](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814937550346895db67aba/html5/thumbnails/14.jpg)
Side-By-Side
PRP = – - 56.1–+ 0.049 MYCT–+ 0.015 MMIN–+ 0.006 MMAX–+ 0.630 CACH– - 0.270 CHMIN–+ 1.46 CHMAX
CHMIN
CACH
MMAXMMAX CHMAX
MMAXCACH
MMIN
MYCT
≤ 7.5 > 7.5
64.6(24 points)
75.7(10 points)
133(16 points)
783(5 points)
29.8(37 points)
19.3(28 points)
157(21 points)
281(11 points)
492(7 points)
59.3(24 points)
37.3(19 points)
18.3(7 points)
≤ 8.5
≤ 2500
≤ 550
≤ 0.5
≤ 12000
≤ 58
≤ 28000
≤ 10000
> 28 > 28000
> 58
> 12000
> 550
> 4250
> 10000
(8.5,28]
(2500, 4250]
(0.5,8.5]
CHMIN
CACH
MMAX
MMAXCACH
≤ 7.5 > 7.5
LM4(50 points)
LM1(65 points)
LM5(21 points)
LM6(23 points)
LM3(24 points)
LM2(26 points)
≤ 8.5
≤ 4250
≤ 0.5
≤ 28000 > 28000
> 4250
> 8.5
(0.5,8.5]
LM1 PRP = 8.29 + 0.004 MMAX + 2.77 CHMINLM2 PRP = 20.3 + 0.004 MMIN – 3.99 CHMIN
+ 0.946 CHMAXLM3 PRP = 38.1 + 0.012 MMINLM4 PRP =19.5 + 0.002 MMAX + 0.698 CACH
+ 0.969 CHMAXLM5 PRP = 285 – 1.46 MYCT + 1.02 CACH
– 9.39 CHMINLM6 PRP = -65.8 + 0.03 MMIN – 2.94 CHMIN
+ 4.98 CHMAX
Linear regressionRegression tree
Model tree
![Page 15: Estimation As A Data Mining Task](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814937550346895db67aba/html5/thumbnails/15.jpg)
Simple Neural Network
3.6
0.80.00.4
Prediction
IncomeGenderAge
2.0 3.5
Output Layer
Hidden Layer
Input Layer
1.9
![Page 16: Estimation As A Data Mining Task](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814937550346895db67aba/html5/thumbnails/16.jpg)
Building the Neural Net
• Recursive process– Assign initial weights– Run training values through network– Compare results to actual value
• Backpropagation– Pass errors back through net– Incorrect node gets less influence
• Military metaphor• Recurrent networks• Genetic algorithms• Simulated annealing
![Page 17: Estimation As A Data Mining Task](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814937550346895db67aba/html5/thumbnails/17.jpg)
Neural Networks
• Strengths– Accurate– Fast to use– Handle missing or corrupt data well
• Weaknesses– Not intuitive– Don’t handle large numbers of
predictors well– Data preprocessing
![Page 18: Estimation As A Data Mining Task](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814937550346895db67aba/html5/thumbnails/18.jpg)
Neural Net Example
•Four years of 30-minute trading data, 1985-1988
–1986 & 1987 for training–1985 for testing
•USD/CHF•Single layer model
–Input nodes: 7–Hidden nodes: 7
•Two layer model–Input nodes: 7–Hidden nodes: 5/2
•Output–Value between -1.0 and 1.0–Rise or fall?
Model % correct
Return %
Random 50.0 ----
Linear 52.5 -6.8
1 hidden layer
53.4 9.9
2 hidden layers
54.0 9.8
Average network
53.7 11.5
![Page 19: Estimation As A Data Mining Task](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814937550346895db67aba/html5/thumbnails/19.jpg)
Accuracy of Models
• Overfitting– Applies to all three– Independent test data
• Statistical measures
![Page 20: Estimation As A Data Mining Task](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814937550346895db67aba/html5/thumbnails/20.jpg)
Statistical Measuresmean-squared error
n
apn
iii
1
2)(
root mean squared error
n
apn
iii
1
2)(
mean absolute error
n
apn
iii
1
relative squared error
n
i i
ii
aa
ap
12
2
)(
)( where
n
aa i i
root relative squared error
n
i i
ii
aa
ap
12
2
)(
)(
relative absolute error
n
i i
ii
aa
ap
1
correlation coefficient
AP
PA
SS
S, where
1
))((
n
aappS i iiPA ,
1
)( 2
n
aaS
i
A
1
)( 2
n
ppS
i
P , and
![Page 21: Estimation As A Data Mining Task](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814937550346895db67aba/html5/thumbnails/21.jpg)
Any questions?
![Page 22: Estimation As A Data Mining Task](https://reader035.vdocuments.us/reader035/viewer/2022070412/56814937550346895db67aba/html5/thumbnails/22.jpg)
Special Bonus Slide