regression models
DESCRIPTION
Regression Models. Fit data Time-series data: Forecast Other data: Predict. Use in Data Mining. One of major analytic models Linear regression The standard – ordinary least squares regression Can use for discriminant analysis Can apply stepwise regression - PowerPoint PPT PresentationTRANSCRIPT
Regression Models
Fit data
Time-series data: Forecast
Other data:
Predict
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
6-2
Use in Data Mining• One of major analytic models
– Linear regression• The standard – ordinary least squares regression• Can use for discriminant analysis• Can apply stepwise regression
– Nonlinear regression• More complex (but less reliable) data fitting
– Logistic regression• When data are categorical (usually binary)
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
6-3
OLS (Ordinary Least Square) Model
error term theis
st variableindependenfor tscoefficien theare
termintercept theis
variabledependent theis where
...
0
22110
n
Y
XXXY
n
nn
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
6-4
OLS Regression• Uses intercept and slope coefficients () to
minimize squared error terms over all i observations
• Fits the data with a linear model
• Time-series data:– Observations over past periods– Best fit line (in terms of minimizing sum of
squared errors)
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
6-5
Regression Output (page 101)
R2 : 0.987
Intercept: 0.642 t=0.286 P=0.776
Week: 5.086 t=53.27 P=0
Requests = 0.642 + 5.086*Week
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
6-6
Time-Series ForecastRegression Forecast
0
50
100
150
200
250
300
0 10 20 30 40 50 60
Week
Requests
Model
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
6-7
Regression Tests• FIT:
– SSE – sum of squared errors• Synonym: SSR – sum of squared residuals
– R2 – proportion explained by model– Adjusted R2 – adjusts calculation to penalize for
number of independent variables• Significance
– F-test - test of overall model significance– t-test - test of significant difference between model
coefficient & zero– P – probability that the coefficient is zero
• (or at least the other side of zero from the coefficient)
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
6-8
Regression Model Tests• SSE (sum of squared errors)
– For each observation, subtract model value from observed, square difference, total over all observations
– By itself means nothing– Can compare across models (lower is better)– Can use to evaluate proportion of variance in data
explained by model• R2
– Ratio of explained squared dependent variable values (MSR) to sum of squares (SST)
• SST = MSR plus SSE– 0 ≤ R2 ≤ 1
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
6-9
Multiple Regression• Can include more than one independent
variable– Trade-off:
• Too many variables – many spurious, overlapping information
• Too few variables – miss important content
– Adding variables will always increase R2
– Adjusted R2 penalizes for additional independent variables
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
6-10
Example: Hiring Data• Dependent Variable – Sales
• Independent Variables:– Years of Education– College GPA– Age– Gender– College Degree
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
6-11
Regression ModelSales = 269025
-17148*YrsEd P = 0.175 -7172*GPA P = 0.812 +4331*Age P = 0.116 -23581*Male P = 0.266 +31001*Degree P = 0.450
R2 = 0.252 Adj R2 = -0.015• Weak model, no IV significant at 0.10
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
6-12
Improved Regression Model
Sales = 173284
- 9991*YrsEd P = 0.098*
+3537*Age P = 0.141
-18730*Male P = 0.328
R2 = 0.218 Adj R2 = 0.070
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
6-13
Logistic Regression
• Data often ordinal or nominal
• Regression based on continuous numbers not appropriate– Need dummy variables
• Binary – either are or are not– LOGISTIC REGRESSION (probability of either 1 or 0)
• Two or more categories– DISCRIMINANT ANALYSIS (perform regression for each
outcome; pick one that fit’s best)
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
6-14
Logistic Regression
• For dependent variables that are nominal or ordinal
• Probability of acceptance of – case i to class j
• Sigmoidal function– (in English, an S curve
from 0 to 1)
iixj
eP
01
1
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
6-15
Insurance Claim Model
Fraud =81.824 -2.778 * Age P = 0.789-75.893 * Male P = 0.758+ 0.017 * Claim P = 0.757-36.648 * Tickets P = 0.824+ 6.914 * Prior P = 0.935-29.362 * Attorney Smith P =
0.776Can get probability by running score through
logistic formula
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
6-16
Linear Discriminant Analysis
• Group objects into predetermined set of outcome classes
• Regression one means of performing discriminant analysis– 2 groups: find cutoff for regression score– More than 2 groups: multiple cutoffs
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
6-17
Centroid Method(NOT regression)
• Binary data
• Divide training set into two groups by binary outcome– Standardize data to remove scales
• Identify means for each independent variably by group (the CENTROID)
• Calculate distance function
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
6-18
Fraud DataAge Claim Tickets Prior Outcome
52 2000 0 1 OK
38 1800 0 0 OK
19 600 2 2 OK
21 5600 1 2 Fraud
41 4200 1 2 Fraud
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
6-19
Standardized & Sorted Fraud DataAge Claim Tickets Prior Outcome
1 0.60 1 0.5 0
0.9 0.64 1 1 0
0 0.88 0 0 0
0.633 0.707 0.667 0.500 0
0.05 0 1 0 1
1 0.16 1 0 1
0.525 0.080 1.000 0.000 1
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
6-20
Distance CalculationsNew To 0 To 1
Age 0.50 (0.633-0.5)2 0.018 (0.525-0.5)2 0.001
Claim 0.30 (0.707-0.3)2 0.166 (0.08-0.3)2 0.048
Tickets 0 (0.667-0)2 0.445 (1-0)2 1.000
Prior 1 (0.5-1)2 0.250 (0-1)2 1.000
Totals 0.879 2.049
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
6-21
Discriminant Analysis with RegressionStandardized data, Binary outcomes
Intercept 0.430 P = 0.670Age -0.421 P = 0.671Gender 0.333 P = 0.733Claim -0.648 P = 0.469Tickets 0.584 P = 0.566Prior Claims -1.091 P = 0.399Attorney 0.573 P = 0.607• R2 = 0.804• Cutoff average of group averages: 0.429
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
6-22
Case: Stepwise Regression
• Stepwise Regression– Automatic selection of independent variables
• Look at F scores of simple regressions• Add variable with greatest F statistic• Check partial F scores for adding each variable not
in model• Delete variables no longer significant• If no external variables significant, quit
• Considered inferior to selection of variables by experts
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
6-23
Credit Card Bankruptcy PredictionFoster & Stine (2004), Journal of the American Statistical Association
• Data on 244,000 credit card accounts– 12-month period– 1 percent default– Cost of granting loan that defaults almost
$5,000– Cost of denying loan that would have paid
about $50
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
6-24
Data Treatment• Divided observations into 5 groups
– Used one for training– Any smaller would have problems due to
insufficient default cases– Used 80% of data for detailed testing
• Regression performed better than C5 model – Even though C5 used costs, regression didn’t
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
6-25
Summary• Regression a basic classical model
– Many forms
• Logistic regression very useful in data mining– Often have binary outcomes– Also can use on categorical data
• Can use for discriminant analysis– To classify