regression models

Regression Models

Fit data

Time-series data: Forecast

Other data:

Predict

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

6-2

Use in Data Mining• One of major analytic models

– Linear regression• The standard – ordinary least squares regression• Can use for discriminant analysis• Can apply stepwise regression

– Nonlinear regression• More complex (but less reliable) data fitting

– Logistic regression• When data are categorical (usually binary)


6-3

OLS (Ordinary Least Square) Model

error term theis

st variableindependenfor tscoefficien theare

termintercept theis

variabledependent theis where

...

0

22110

n

Y

XXXY

n

nn


6-4

OLS Regression• Uses intercept and slope coefficients () to

minimize squared error terms over all i observations

• Fits the data with a linear model

• Time-series data:– Observations over past periods– Best fit line (in terms of minimizing sum of

squared errors)


6-5

Regression Output (page 101)

R2 : 0.987

Intercept: 0.642 t=0.286 P=0.776

Week: 5.086 t=53.27 P=0

Requests = 0.642 + 5.086*Week


6-6

Time-Series ForecastRegression Forecast

0

50

100

150

200

250

300

0 10 20 30 40 50 60

Week

Requests

Model


6-7

Regression Tests• FIT:

– SSE – sum of squared errors• Synonym: SSR – sum of squared residuals

– R2 – proportion explained by model– Adjusted R2 – adjusts calculation to penalize for

number of independent variables• Significance

– F-test - test of overall model significance– t-test - test of significant difference between model

coefficient & zero– P – probability that the coefficient is zero

• (or at least the other side of zero from the coefficient)


6-8

Regression Model Tests• SSE (sum of squared errors)

– For each observation, subtract model value from observed, square difference, total over all observations

– By itself means nothing– Can compare across models (lower is better)– Can use to evaluate proportion of variance in data

explained by model• R2

– Ratio of explained squared dependent variable values (MSR) to sum of squares (SST)

• SST = MSR plus SSE– 0 ≤ R2 ≤ 1


6-9

Multiple Regression• Can include more than one independent

variable– Trade-off:

• Too many variables – many spurious, overlapping information

• Too few variables – miss important content

– Adding variables will always increase R2

– Adjusted R2 penalizes for additional independent variables


6-10

Example: Hiring Data• Dependent Variable – Sales

• Independent Variables:– Years of Education– College GPA– Age– Gender– College Degree


6-11

Regression ModelSales = 269025

-17148*YrsEd P = 0.175 -7172*GPA P = 0.812 +4331*Age P = 0.116 -23581*Male P = 0.266 +31001*Degree P = 0.450

R2 = 0.252 Adj R2 = -0.015• Weak model, no IV significant at 0.10


6-12

Improved Regression Model

Sales = 173284

- 9991*YrsEd P = 0.098*

+3537*Age P = 0.141

-18730*Male P = 0.328

R2 = 0.218 Adj R2 = 0.070


6-13

Logistic Regression

• Data often ordinal or nominal

• Regression based on continuous numbers not appropriate– Need dummy variables

• Binary – either are or are not– LOGISTIC REGRESSION (probability of either 1 or 0)

• Two or more categories– DISCRIMINANT ANALYSIS (perform regression for each

outcome; pick one that fit’s best)


6-14

Logistic Regression

• For dependent variables that are nominal or ordinal

• Probability of acceptance of – case i to class j

• Sigmoidal function– (in English, an S curve

from 0 to 1)

iixj

eP

01

1


6-15

Insurance Claim Model

Fraud =81.824 -2.778 * Age P = 0.789-75.893 * Male P = 0.758+ 0.017 * Claim P = 0.757-36.648 * Tickets P = 0.824+ 6.914 * Prior P = 0.935-29.362 * Attorney Smith P =

0.776Can get probability by running score through

logistic formula


6-16

Linear Discriminant Analysis

• Group objects into predetermined set of outcome classes

• Regression one means of performing discriminant analysis– 2 groups: find cutoff for regression score– More than 2 groups: multiple cutoffs


6-17

Centroid Method(NOT regression)

• Binary data

• Divide training set into two groups by binary outcome– Standardize data to remove scales

• Identify means for each independent variably by group (the CENTROID)

• Calculate distance function


6-18

Fraud DataAge Claim Tickets Prior Outcome

52 2000 0 1 OK

38 1800 0 0 OK

19 600 2 2 OK

21 5600 1 2 Fraud

41 4200 1 2 Fraud


6-19

Standardized & Sorted Fraud DataAge Claim Tickets Prior Outcome

1 0.60 1 0.5 0

0.9 0.64 1 1 0

0 0.88 0 0 0

0.633 0.707 0.667 0.500 0

0.05 0 1 0 1

1 0.16 1 0 1

0.525 0.080 1.000 0.000 1


6-20

Distance CalculationsNew To 0 To 1

Age 0.50 (0.633-0.5)2 0.018 (0.525-0.5)2 0.001

Claim 0.30 (0.707-0.3)2 0.166 (0.08-0.3)2 0.048

Tickets 0 (0.667-0)2 0.445 (1-0)2 1.000

Prior 1 (0.5-1)2 0.250 (0-1)2 1.000

Totals 0.879 2.049


6-21

Discriminant Analysis with RegressionStandardized data, Binary outcomes

Intercept 0.430 P = 0.670Age -0.421 P = 0.671Gender 0.333 P = 0.733Claim -0.648 P = 0.469Tickets 0.584 P = 0.566Prior Claims -1.091 P = 0.399Attorney 0.573 P = 0.607• R2 = 0.804• Cutoff average of group averages: 0.429


6-22

Case: Stepwise Regression

• Stepwise Regression– Automatic selection of independent variables

• Look at F scores of simple regressions• Add variable with greatest F statistic• Check partial F scores for adding each variable not

in model• Delete variables no longer significant• If no external variables significant, quit

• Considered inferior to selection of variables by experts


6-23

Credit Card Bankruptcy PredictionFoster & Stine (2004), Journal of the American Statistical Association

• Data on 244,000 credit card accounts– 12-month period– 1 percent default– Cost of granting loan that defaults almost

$5,000– Cost of denying loan that would have paid

about $50


6-24

Data Treatment• Divided observations into 5 groups

– Used one for training– Any smaller would have problems due to

insufficient default cases– Used 80% of data for detailed testing

• Regression performed better than C5 model – Even though C5 used costs, regression didn’t


6-25

Summary• Regression a basic classical model

– Many forms

• Logistic regression very useful in data mining– Often have binary outcomes– Also can use on categorical data

• Can use for discriminant analysis– To classify

regression models

Documents