1 statistical techniques chapter 10. 2 10.1 linear regression analysis simple linear regression

1

Statistical Techniques

Chapter 10

2

10.1 Linear Regression Analysis

baxy

2x

xyb

cnxnaxaxaxanxxxxf .......)...,,( 332211321

Simple Linear Regression

n

yb

n

ya

3

Table 10.1 • District Office Building Data

Space Offices Entrances Age Value

2310 2 2 20 $142,0002333 2 2 12 $144,0002356 3 1.5 33 $151,0002379 3 2 43 $150,0002402 2 3 53 $139,0002425 4 2 23 $169,0002448 2 1.5 99 $126,0002471 2 2 34 $142,9002494 3 3 23 $163,0002517 4 4 55 $169,0002540 2 3 22 $149,000

Multiple Linear Regression with Excel

4

A Regression Equation for the District Office Building Data

83.5231724.23421.2553

77.1252964.27

AgeEntrances

OfficesSpaceValue

Table 10.2 • Regression Statistics for the Office Building Data

–234.2371645 2553.211 12529.77 27.64139 52317.8313.26801148 530.6692 400.0668 5.429374 12237.360.996747993 970.5785 #N/A #N/A #N/A459.7536742 6 #N/A #N/A #N/A1732393319 5652135 #N/A #N/A #N/A

5

0

20000

40000

60000

80000

100000

120000

140000

160000

180000

2200 2250 2300 2350 2400 2450 2500 2550 2600

Acc

esse

dV

alu

e

Floor Space

6

Test 1

Test 3Test 2

Test 4

>=

>=

>=

<< >=

<

<

LRM1 LRM2 LRM3

LRM4 LRM5

Regression Trees

7

Amt

TotCost

TotCost

<= 246

LRM6

LRM7

Trips

Trips

TotCost

LRM1 Amt

LRM4 LRM5

LRM8 LRM9

Amt

LRM2 LRM3

<= 178 > 178 <= 136 > 136

> 171<= 171 > 390<= 390 > 309

> 7.5

> 39

<= 309

<= 7.5

<= 39

> 246

8

Transforming the Linear Regression Model

Logistic regression is a nonlinear regression technique that associates a conditional probability with each data instance.

10.2 Logistic Regression

9

The Logistic Regression Model

exp as denoted often logarithms natural of basetheis

where

1)|1(

e

xypc

c

e

e

ax

ax

0.000

0.200

0.400

0.600

0.800

1.000

1.200

-6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6

x

P(y

= 1

| x

)

10

Logistic Regression: An Example

691.17415.0314.8

827.190001.0

AgeSex

InsCreditCardIncomecax

Table 10.3 • Logistic Regression: Dependent Variable = Life Insurance Promotion

Credit Card Life Insurance ComputedInstance Income Insurance Sex Age Promotion Probability

1 40K 0 1 45 0 0.0072 30K 0 0 40 1 0.9873 40K 0 1 42 0 0.0244 30K 1 1 43 1 1.0005 50K 0 0 38 1 0.9996 20K 0 0 55 0 0.0497 30K 1 1 35 1 1.0008 20K 0 1 27 0 0.5849 30K 0 1 43 0 0.00510 30K 0 0 41 1 0.98111 40K 0 0 43 1 0.98512 20K 0 1 29 1 0.38013 50K 0 0 39 1 0.99914 40K 0 1 55 0 0.00015 20K 1 0 19 1 1.000

11

10.3 Bayes Classifier

H

H

EP

HPHEPEHP

withassociated evidence theis E

testedbe tohypothesis theis where

)(

)()|()|(

12

Table 10.4 • Data for Bayes Classifier

Magazine Watch Life Insurance Credit CardPromotion Promotion Promotion Insurance Sex

Yes No No No MaleYes Yes Yes Yes FemaleNo No No No MaleYes Yes Yes Yes MaleYes No Yes No FemaleNo No No No FemaleYes Yes Yes Yes MaleNo No No No MaleYes No No No MaleYes Yes Yes No Female

Bayes Classifier: An Example

13

The Instance to be Classified

Magazine Promotion = Yes

Watch Promotion = Yes

Life Insurance Promotion = No

Credit Card Insurance = No

Sex = ?

Table 10.5 • Counts and Probabilities for Attribute Sex

Magazine Watch Life Insurance Credit CardPromotion Promotion Promotion Insurance

Sex Male Female Male Female Male Female Male Female

Yes 4 3 2 2 2 3 2 1No 2 1 4 2 4 1 4 3

Ratio: yes/total 4/6 3/4 2/6 2/4 2/6 3/4 2/6 1/4Ratio: no/total 2/6 1/4 4/6 2/4 4/6 1/4 4/6 3/4

14

Computing The Probability For Sex = Male

)(

)()|()|(

EP

malesexPmalesexEPEmalesexP

15

Conditional Probabilities for Sex = Male

P(magazine promotion = yes | sex = male) = 4/6

P(watch promotion = yes | sex = male) = 2/6

P(life insurance promotion = no | sex = male) = 4/6

P(credit card insurance = no | sex = male) = 4/6

P(E | sex =male) = (4/6) (2/6) (4/6) (4/6) = 8/81

16

The Probability for Sex=Male Given Evidence E

P(sex = male | E) 0.0593 / P(E)

The Probability for Sex=Female Given Evidence E

P(sex = female| E) 0.0281 / P(E)

17

Zero-Valued Attribute Counts

attribute for the valuespossible ofnumber

total theofpart fractional equal an is p

1)(usually 1 and 0 between a value is

))((

k

kd

pkn

18

Missing Data

With Bayes classifier missing data items are ignored.

19

Numeric Data

where

e = the exponential function

= the class mean for the given numerical attribute

= the class standard deviation for the attribute

x = the attribute value

)2/()( 22

)2/(1)( xexf

20

Table 10.6 • Addition of Attribute Age to the Bayes Classifier Dataset

Magazine Watch Life Insurance Credit CardPromotion Promotion Promotion Insurance Age Sex

Yes No No No 45 MaleYes Yes Yes Yes 40 FemaleNo No No No 42 MaleYes Yes Yes Yes 30 MaleYes No Yes No 38 FemaleNo No No No 55 FemaleYes Yes Yes Yes 35 MaleNo No No No 27 MaleYes No No No 43 MaleYes Yes Yes No 41 Female

21

Agglomerative Clustering

1. Place each instance into a separate partition.

2. Until all instances are part of a single cluster:

a. Determine the two most similar clusters.

b. Merge the clusters chosen into a single cluster.

3. Choose a clustering formed by one of the step 2 iterations as a final result.

10.4 Clustering Algorithms

22

Table 10.7 • Five Instances from the Credit Card Promotion Database

Instance Income Magazine Watch Life InsuranceRange Promotion Promotion Promotion Sex

I1 40–50K Yes No No MaleI2 25–35K Yes Yes Yes FemaleI3 40–50K No No No MaleI4 25–35K Yes Yes Yes MaleI5 50–60K Yes No Yes Female

Agglomerative Clustering: An Example

23

Table 10.8 • Agglomerative Clustering: First Iteration

I1 I2 I3 I4 I5

I1 1.00I2 0.20 1.00I3 0.80 0.00 1.00I4 0.40 0.80 0.20 1.00I5 0.40 0.60 0.20 0.40 1.00

Table 10.9 • Agglomerative Clustering: Second Iteration

I1 I3 I2 I4 I5

I1 I3 0.80I2 0.33 1.00I4 0.47 0.80 1.00I5 0.47 0.60 0.40 1.00

24

A final clustering

• Compare the average within-cluster similarity to the overall similarity

• Compare the similarity within each cluster to the similarity between each cluster

• Examine the rule sets generated by each saved clustering

25

Conceptual Clustering

1. Create a cluster with the first instance as its only member.

2. For each remaining instance, take one of two actions at each tree level.

a. Place the new instance into an existing cluster.

b. Create a new concept cluster having the new instance as its only member.

26

Data for Conceptual Clustering

Table 10.10 • Data for Conceptual Clustering

Tails Color Nuclei

I1 One Light OneI2 Two Light TwoI3 Two Dark TwoI4 One Dark ThreeI5 One Light TwoI6 One Light TwoI7 One Light Three

27

N4

Tails

Nuclei

Color

OneTwo

1.01.0

.71

.291.01.0

.71

.29LightDarkOneTwo

Three

.14

.57

.29

1.01.01.0

P(N) = 7/7 P(V|C) P(C|V)

Tails

Nuclei

Color

OneTwo

.670.0

1.00.0

.670.0

1.00.0

LightDarkOneTwo

Three

0.01.00.0

0.01.00.0

P(N5) = 2/3 P(V|C) P(C|V)

Tails

Nuclei

Color

OneTwo

.330.0

1.00.0

.330.0

1.00.0

LightDarkOneTwo

Three

1.00.00.0

1.00.00.0

P(N3) = 1/3 P(V|C) P(C|V)

Tails

Nuclei

Color

OneTwo

.60.0

1.00.0

.60.0

1.00.0

LightDarkOneTwo

Three

.33

.670.0

1.0.50.0

P(N1) = 3/7 P(V|C) P(C|V)

Tails

Nuclei

Color

OneTwo

0.01.0

0.01.0

.2

.5.5.5

LightDarkOneTwo

Three

0.01.00.0

0.0.5

0.0

P(N2) = 2/7 P(V|C) P(C|V)

Tails

Nuclei

Color

OneTwo

.40.0

1.00.0

.2

.5.5.5

LightDarkOneTwo

Three

0.00.01.0

0.00.01.0

P(N4) = 2/7 P(V|C) P(C|V)

I2

I5 I6

I3 I4 I7

I1

N1

N5

N2

N3

N

28

COBWEB(Fisher 1987)

Heuristic measure of partition quality

Category utility

29

Expectation Maximization

1. Similar to the K-Means procedure

2. Makes use of the finite Gaussian mixtures model

3. The mixture model assigns each individual data instance a probability

30

3.3 The K-Means Algorithm

1. Choose a value for K, the total number of clusters.

2. Randomly choose K points as cluster centers.

3. Assign the remaining instances to their closest cluster center.

4. Calculate a new cluster center for each cluster.

5. Repeat steps 3-5 until the cluster centers do not change.

31

Table 3.6 • K-Means Input Values

Instance X Y

1 1.0 1.52 1.0 4.53 2.0 1.54 2.0 3.55 3.0 2.56 5.0 6.0

0

1

2

3

4

5

6

7

0 1 2 3 4 5 6

f(x)

x

32

Expectation Maximization

1. Guess initial values for the parameters.

2. Until a termination criterion is achieved:

a. Use the probability density function for normal distributions to compute the cluster

probability for each instance.

b. Use the probability scores assigned to each instance in step 2(a) to re-estimate the parameters.

33

Table 10.11 • An EM Clustering of Gamma-Ray Burst Data

Cluster 0 Cluster 1 Cluster 2

# Instances 518 340 321

Log Fluence

Mean –5.6670 –4.8131 –6.3657SD 0.4088 0.5301 0.5812

Log HR321

Mean 0.0538 0.2949 0.5478SD 0.3018 0.1939 0.2766

Log T90

Mean 1.2709 1.7159 –0.3794SD 0.4906 0.3793 0.4825

34

Inductive problem-solving methods

• Query and visualization techniques

• Machine learning techniques

• Statistical techniques

10.5 Heuristics or Statistics?

35

Query and Visualization Techniques

• Query tools and OLAP tools–Unable to find hidden patterns

• Visualization tools–Decision trees, bar and pie charts, histograms, maps, surface plot diagrams–Applied after a data mining process to help us understand what has been discovered

36

Machine Learning and Statistical Techniques

1. Statistical techniques typically assume an underlying distribution for the data whereas machine learning techniques do not.

2. Machine learning techniques tend to have a human flavor.

3. Machine learning techniques are better able to deal with missing and noisy data.

4. Most machine learning techniques are able to explain their behavior.

5. Statistical techniques tend to perform poorly with large-sized data.

1 statistical techniques chapter 10. 2 10.1 linear regression analysis simple linear regression

Documents

regression equation

regression statistics

male pmagazine promotion

pe sex

nonlinear regression

yeslife insurance promotion

plife insurance promotion

classifiedmagazine promotion