1 statistical techniques chapter 10. 2 10.1 linear regression analysis simple linear regression
TRANSCRIPT
![Page 1: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression](https://reader034.vdocuments.us/reader034/viewer/2022050809/56649efb5503460f94c0d605/html5/thumbnails/1.jpg)
1
Statistical Techniques
Chapter 10
![Page 2: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression](https://reader034.vdocuments.us/reader034/viewer/2022050809/56649efb5503460f94c0d605/html5/thumbnails/2.jpg)
2
10.1 Linear Regression Analysis
baxy
2x
xyb
cnxnaxaxaxanxxxxf .......)...,,( 332211321
Simple Linear Regression
n
yb
n
ya
![Page 3: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression](https://reader034.vdocuments.us/reader034/viewer/2022050809/56649efb5503460f94c0d605/html5/thumbnails/3.jpg)
3
Table 10.1 • District Office Building Data
Space Offices Entrances Age Value
2310 2 2 20 $142,0002333 2 2 12 $144,0002356 3 1.5 33 $151,0002379 3 2 43 $150,0002402 2 3 53 $139,0002425 4 2 23 $169,0002448 2 1.5 99 $126,0002471 2 2 34 $142,9002494 3 3 23 $163,0002517 4 4 55 $169,0002540 2 3 22 $149,000
Multiple Linear Regression with Excel
![Page 4: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression](https://reader034.vdocuments.us/reader034/viewer/2022050809/56649efb5503460f94c0d605/html5/thumbnails/4.jpg)
4
A Regression Equation for the District Office Building Data
83.5231724.23421.2553
77.1252964.27
AgeEntrances
OfficesSpaceValue
Table 10.2 • Regression Statistics for the Office Building Data
–234.2371645 2553.211 12529.77 27.64139 52317.8313.26801148 530.6692 400.0668 5.429374 12237.360.996747993 970.5785 #N/A #N/A #N/A459.7536742 6 #N/A #N/A #N/A1732393319 5652135 #N/A #N/A #N/A
![Page 5: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression](https://reader034.vdocuments.us/reader034/viewer/2022050809/56649efb5503460f94c0d605/html5/thumbnails/5.jpg)
5
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
2200 2250 2300 2350 2400 2450 2500 2550 2600
Acc
esse
dV
alu
e
Floor Space
![Page 6: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression](https://reader034.vdocuments.us/reader034/viewer/2022050809/56649efb5503460f94c0d605/html5/thumbnails/6.jpg)
6
Test 1
Test 3Test 2
Test 4
>=
>=
>=
<< >=
<
<
LRM1 LRM2 LRM3
LRM4 LRM5
Regression Trees
![Page 7: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression](https://reader034.vdocuments.us/reader034/viewer/2022050809/56649efb5503460f94c0d605/html5/thumbnails/7.jpg)
7
Amt
TotCost
TotCost
<= 246
LRM6
LRM7
Trips
Trips
TotCost
LRM1 Amt
LRM4 LRM5
LRM8 LRM9
Amt
LRM2 LRM3
<= 178 > 178 <= 136 > 136
> 171<= 171 > 390<= 390 > 309
> 7.5
> 39
<= 309
<= 7.5
<= 39
> 246
![Page 8: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression](https://reader034.vdocuments.us/reader034/viewer/2022050809/56649efb5503460f94c0d605/html5/thumbnails/8.jpg)
8
Transforming the Linear Regression Model
Logistic regression is a nonlinear regression technique that associates a conditional probability with each data instance.
10.2 Logistic Regression
![Page 9: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression](https://reader034.vdocuments.us/reader034/viewer/2022050809/56649efb5503460f94c0d605/html5/thumbnails/9.jpg)
9
The Logistic Regression Model
exp as denoted often logarithms natural of basetheis
where
1)|1(
e
xypc
c
e
e
ax
ax
0.000
0.200
0.400
0.600
0.800
1.000
1.200
-6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6
x
P(y
= 1
| x
)
![Page 10: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression](https://reader034.vdocuments.us/reader034/viewer/2022050809/56649efb5503460f94c0d605/html5/thumbnails/10.jpg)
10
Logistic Regression: An Example
691.17415.0314.8
827.190001.0
AgeSex
InsCreditCardIncomecax
Table 10.3 • Logistic Regression: Dependent Variable = Life Insurance Promotion
Credit Card Life Insurance ComputedInstance Income Insurance Sex Age Promotion Probability
1 40K 0 1 45 0 0.0072 30K 0 0 40 1 0.9873 40K 0 1 42 0 0.0244 30K 1 1 43 1 1.0005 50K 0 0 38 1 0.9996 20K 0 0 55 0 0.0497 30K 1 1 35 1 1.0008 20K 0 1 27 0 0.5849 30K 0 1 43 0 0.00510 30K 0 0 41 1 0.98111 40K 0 0 43 1 0.98512 20K 0 1 29 1 0.38013 50K 0 0 39 1 0.99914 40K 0 1 55 0 0.00015 20K 1 0 19 1 1.000
![Page 11: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression](https://reader034.vdocuments.us/reader034/viewer/2022050809/56649efb5503460f94c0d605/html5/thumbnails/11.jpg)
11
10.3 Bayes Classifier
H
H
EP
HPHEPEHP
withassociated evidence theis E
testedbe tohypothesis theis where
)(
)()|()|(
![Page 12: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression](https://reader034.vdocuments.us/reader034/viewer/2022050809/56649efb5503460f94c0d605/html5/thumbnails/12.jpg)
12
Table 10.4 • Data for Bayes Classifier
Magazine Watch Life Insurance Credit CardPromotion Promotion Promotion Insurance Sex
Yes No No No MaleYes Yes Yes Yes FemaleNo No No No MaleYes Yes Yes Yes MaleYes No Yes No FemaleNo No No No FemaleYes Yes Yes Yes MaleNo No No No MaleYes No No No MaleYes Yes Yes No Female
Bayes Classifier: An Example
![Page 13: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression](https://reader034.vdocuments.us/reader034/viewer/2022050809/56649efb5503460f94c0d605/html5/thumbnails/13.jpg)
13
The Instance to be Classified
Magazine Promotion = Yes
Watch Promotion = Yes
Life Insurance Promotion = No
Credit Card Insurance = No
Sex = ?
Table 10.5 • Counts and Probabilities for Attribute Sex
Magazine Watch Life Insurance Credit CardPromotion Promotion Promotion Insurance
Sex Male Female Male Female Male Female Male Female
Yes 4 3 2 2 2 3 2 1No 2 1 4 2 4 1 4 3
Ratio: yes/total 4/6 3/4 2/6 2/4 2/6 3/4 2/6 1/4Ratio: no/total 2/6 1/4 4/6 2/4 4/6 1/4 4/6 3/4
![Page 14: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression](https://reader034.vdocuments.us/reader034/viewer/2022050809/56649efb5503460f94c0d605/html5/thumbnails/14.jpg)
14
Computing The Probability For Sex = Male
)(
)()|()|(
EP
malesexPmalesexEPEmalesexP
![Page 15: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression](https://reader034.vdocuments.us/reader034/viewer/2022050809/56649efb5503460f94c0d605/html5/thumbnails/15.jpg)
15
Conditional Probabilities for Sex = Male
P(magazine promotion = yes | sex = male) = 4/6
P(watch promotion = yes | sex = male) = 2/6
P(life insurance promotion = no | sex = male) = 4/6
P(credit card insurance = no | sex = male) = 4/6
P(E | sex =male) = (4/6) (2/6) (4/6) (4/6) = 8/81
![Page 16: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression](https://reader034.vdocuments.us/reader034/viewer/2022050809/56649efb5503460f94c0d605/html5/thumbnails/16.jpg)
16
The Probability for Sex=Male Given Evidence E
P(sex = male | E) 0.0593 / P(E)
The Probability for Sex=Female Given Evidence E
P(sex = female| E) 0.0281 / P(E)
![Page 17: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression](https://reader034.vdocuments.us/reader034/viewer/2022050809/56649efb5503460f94c0d605/html5/thumbnails/17.jpg)
17
Zero-Valued Attribute Counts
attribute for the valuespossible ofnumber
total theofpart fractional equal an is p
1)(usually 1 and 0 between a value is
))((
k
kd
pkn
![Page 18: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression](https://reader034.vdocuments.us/reader034/viewer/2022050809/56649efb5503460f94c0d605/html5/thumbnails/18.jpg)
18
Missing Data
With Bayes classifier missing data items are ignored.
![Page 19: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression](https://reader034.vdocuments.us/reader034/viewer/2022050809/56649efb5503460f94c0d605/html5/thumbnails/19.jpg)
19
Numeric Data
where
e = the exponential function
= the class mean for the given numerical attribute
= the class standard deviation for the attribute
x = the attribute value
)2/()( 22
)2/(1)( xexf
![Page 20: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression](https://reader034.vdocuments.us/reader034/viewer/2022050809/56649efb5503460f94c0d605/html5/thumbnails/20.jpg)
20
Table 10.6 • Addition of Attribute Age to the Bayes Classifier Dataset
Magazine Watch Life Insurance Credit CardPromotion Promotion Promotion Insurance Age Sex
Yes No No No 45 MaleYes Yes Yes Yes 40 FemaleNo No No No 42 MaleYes Yes Yes Yes 30 MaleYes No Yes No 38 FemaleNo No No No 55 FemaleYes Yes Yes Yes 35 MaleNo No No No 27 MaleYes No No No 43 MaleYes Yes Yes No 41 Female
![Page 21: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression](https://reader034.vdocuments.us/reader034/viewer/2022050809/56649efb5503460f94c0d605/html5/thumbnails/21.jpg)
21
Agglomerative Clustering
1. Place each instance into a separate partition.
2. Until all instances are part of a single cluster:
a. Determine the two most similar clusters.
b. Merge the clusters chosen into a single cluster.
3. Choose a clustering formed by one of the step 2 iterations as a final result.
10.4 Clustering Algorithms
![Page 22: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression](https://reader034.vdocuments.us/reader034/viewer/2022050809/56649efb5503460f94c0d605/html5/thumbnails/22.jpg)
22
Table 10.7 • Five Instances from the Credit Card Promotion Database
Instance Income Magazine Watch Life InsuranceRange Promotion Promotion Promotion Sex
I1 40–50K Yes No No MaleI2 25–35K Yes Yes Yes FemaleI3 40–50K No No No MaleI4 25–35K Yes Yes Yes MaleI5 50–60K Yes No Yes Female
Agglomerative Clustering: An Example
![Page 23: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression](https://reader034.vdocuments.us/reader034/viewer/2022050809/56649efb5503460f94c0d605/html5/thumbnails/23.jpg)
23
Table 10.8 • Agglomerative Clustering: First Iteration
I1 I2 I3 I4 I5
I1 1.00I2 0.20 1.00I3 0.80 0.00 1.00I4 0.40 0.80 0.20 1.00I5 0.40 0.60 0.20 0.40 1.00
Table 10.9 • Agglomerative Clustering: Second Iteration
I1 I3 I2 I4 I5
I1 I3 0.80I2 0.33 1.00I4 0.47 0.80 1.00I5 0.47 0.60 0.40 1.00
![Page 24: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression](https://reader034.vdocuments.us/reader034/viewer/2022050809/56649efb5503460f94c0d605/html5/thumbnails/24.jpg)
24
A final clustering
• Compare the average within-cluster similarity to the overall similarity
• Compare the similarity within each cluster to the similarity between each cluster
• Examine the rule sets generated by each saved clustering
![Page 25: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression](https://reader034.vdocuments.us/reader034/viewer/2022050809/56649efb5503460f94c0d605/html5/thumbnails/25.jpg)
25
Conceptual Clustering
1. Create a cluster with the first instance as its only member.
2. For each remaining instance, take one of two actions at each tree level.
a. Place the new instance into an existing cluster.
b. Create a new concept cluster having the new instance as its only member.
![Page 26: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression](https://reader034.vdocuments.us/reader034/viewer/2022050809/56649efb5503460f94c0d605/html5/thumbnails/26.jpg)
26
Data for Conceptual Clustering
Table 10.10 • Data for Conceptual Clustering
Tails Color Nuclei
I1 One Light OneI2 Two Light TwoI3 Two Dark TwoI4 One Dark ThreeI5 One Light TwoI6 One Light TwoI7 One Light Three
![Page 27: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression](https://reader034.vdocuments.us/reader034/viewer/2022050809/56649efb5503460f94c0d605/html5/thumbnails/27.jpg)
27
N4
Tails
Nuclei
Color
OneTwo
1.01.0
.71
.291.01.0
.71
.29LightDarkOneTwo
Three
.14
.57
.29
1.01.01.0
P(N) = 7/7 P(V|C) P(C|V)
Tails
Nuclei
Color
OneTwo
.670.0
1.00.0
.670.0
1.00.0
LightDarkOneTwo
Three
0.01.00.0
0.01.00.0
P(N5) = 2/3 P(V|C) P(C|V)
Tails
Nuclei
Color
OneTwo
.330.0
1.00.0
.330.0
1.00.0
LightDarkOneTwo
Three
1.00.00.0
1.00.00.0
P(N3) = 1/3 P(V|C) P(C|V)
Tails
Nuclei
Color
OneTwo
.60.0
1.00.0
.60.0
1.00.0
LightDarkOneTwo
Three
.33
.670.0
1.0.50.0
P(N1) = 3/7 P(V|C) P(C|V)
Tails
Nuclei
Color
OneTwo
0.01.0
0.01.0
.2
.5.5.5
LightDarkOneTwo
Three
0.01.00.0
0.0.5
0.0
P(N2) = 2/7 P(V|C) P(C|V)
Tails
Nuclei
Color
OneTwo
.40.0
1.00.0
.2
.5.5.5
LightDarkOneTwo
Three
0.00.01.0
0.00.01.0
P(N4) = 2/7 P(V|C) P(C|V)
I2
I5 I6
I3 I4 I7
I1
N1
N5
N2
N3
N
![Page 28: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression](https://reader034.vdocuments.us/reader034/viewer/2022050809/56649efb5503460f94c0d605/html5/thumbnails/28.jpg)
28
COBWEB(Fisher 1987)
Heuristic measure of partition quality
Category utility
![Page 29: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression](https://reader034.vdocuments.us/reader034/viewer/2022050809/56649efb5503460f94c0d605/html5/thumbnails/29.jpg)
29
Expectation Maximization
1. Similar to the K-Means procedure
2. Makes use of the finite Gaussian mixtures model
3. The mixture model assigns each individual data instance a probability
![Page 30: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression](https://reader034.vdocuments.us/reader034/viewer/2022050809/56649efb5503460f94c0d605/html5/thumbnails/30.jpg)
30
3.3 The K-Means Algorithm
1. Choose a value for K, the total number of clusters.
2. Randomly choose K points as cluster centers.
3. Assign the remaining instances to their closest cluster center.
4. Calculate a new cluster center for each cluster.
5. Repeat steps 3-5 until the cluster centers do not change.
![Page 31: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression](https://reader034.vdocuments.us/reader034/viewer/2022050809/56649efb5503460f94c0d605/html5/thumbnails/31.jpg)
31
Table 3.6 • K-Means Input Values
Instance X Y
1 1.0 1.52 1.0 4.53 2.0 1.54 2.0 3.55 3.0 2.56 5.0 6.0
0
1
2
3
4
5
6
7
0 1 2 3 4 5 6
f(x)
x
![Page 32: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression](https://reader034.vdocuments.us/reader034/viewer/2022050809/56649efb5503460f94c0d605/html5/thumbnails/32.jpg)
32
Expectation Maximization
1. Guess initial values for the parameters.
2. Until a termination criterion is achieved:
a. Use the probability density function for normal distributions to compute the cluster
probability for each instance.
b. Use the probability scores assigned to each instance in step 2(a) to re-estimate the parameters.
![Page 33: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression](https://reader034.vdocuments.us/reader034/viewer/2022050809/56649efb5503460f94c0d605/html5/thumbnails/33.jpg)
33
Table 10.11 • An EM Clustering of Gamma-Ray Burst Data
Cluster 0 Cluster 1 Cluster 2
# Instances 518 340 321
Log Fluence
Mean –5.6670 –4.8131 –6.3657SD 0.4088 0.5301 0.5812
Log HR321
Mean 0.0538 0.2949 0.5478SD 0.3018 0.1939 0.2766
Log T90
Mean 1.2709 1.7159 –0.3794SD 0.4906 0.3793 0.4825
![Page 34: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression](https://reader034.vdocuments.us/reader034/viewer/2022050809/56649efb5503460f94c0d605/html5/thumbnails/34.jpg)
34
Inductive problem-solving methods
• Query and visualization techniques
• Machine learning techniques
• Statistical techniques
10.5 Heuristics or Statistics?
![Page 35: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression](https://reader034.vdocuments.us/reader034/viewer/2022050809/56649efb5503460f94c0d605/html5/thumbnails/35.jpg)
35
Query and Visualization Techniques
• Query tools and OLAP tools–Unable to find hidden patterns
• Visualization tools–Decision trees, bar and pie charts, histograms, maps, surface plot diagrams–Applied after a data mining process to help us understand what has been discovered
![Page 36: 1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression](https://reader034.vdocuments.us/reader034/viewer/2022050809/56649efb5503460f94c0d605/html5/thumbnails/36.jpg)
36
Machine Learning and Statistical Techniques
1. Statistical techniques typically assume an underlying distribution for the data whereas machine learning techniques do not.
2. Machine learning techniques tend to have a human flavor.
3. Machine learning techniques are better able to deal with missing and noisy data.
4. Most machine learning techniques are able to explain their behavior.
5. Statistical techniques tend to perform poorly with large-sized data.