discriminantanalysis_basicrelationships

122
SW388R7 Data Analysis & Computers II Slide 1 Discriminant Analysis – Basic Relationships Discriminant Functions and Scores Describing Relationships Classification Accuracy Sample Problems

Upload: ebtgf

Post on 24-Oct-2014

103 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 1

Discriminant Analysis – Basic Relationships

Discriminant Functions and Scores

Describing Relationships

Classification Accuracy

Sample Problems

Page 2: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 2

Discriminant analysis

Discriminant analysis is used to analyze relationships between a non-metric dependent variable and metric or dichotomous independent variables.

Discriminant analysis attempts to use the independent variables to distinguish among the groups or categories of the dependent variable.

The usefulness of a discriminant model is based upon its accuracy rate, or ability to predict the known group memberships in the categories of the dependent variable.

Page 3: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 3

Discriminant scores

Discriminant analysis works by creating a new variable called the discriminant function score which is used to predict to which group a case belongs.

Discriminant function scores are computed similarly to factor scores, i.e. using eigenvalues. The computations find the coefficients for the independent variables that maximize the measure of distance between the groups defined by the dependent variable.

The discriminant function is similar to a regression equation in which the independent variables are multiplied by coefficients and summed to produce a score.

Page 4: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 4

Discriminant functions

Conceptually, we can think of the discriminant function or equation as defining the boundary between groups.

Discriminant scores are standardized, so that if the score falls on one side of the boundary (standard score less than zero, the case is predicted to be a member of one group) and if the score falls on the other side of the boundary (positive standard score), it is predicted to be a member of the other group.

Page 5: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 5

Number of functions

If the dependent variable defines two groups, one statistically significant discriminant function is required to distinguish the groups; if the dependent variable defines three groups, two statistically significant discriminant functions are required to distinguish among the three groups; etc.

If a discriminant function is able to distinguish among groups, it must have a strong relationship to at least one of the independent variables.

The number of possible discriminant functions in an analysis is limited to the smaller of the number of independent variables or one less than the number of groups defined by the dependent variable.

Page 6: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 6

Overall test of relationship

The overall test of relationship among the independent variables and groups defined by the dependent variable is a series of tests that each of the functions needed to distinguish among the groups is statistically significant.

In some analyses, we might discover that two or more of the groups defined by the dependent variable cannot be distinguished using the available independent variables. While it is reasonable to interpret a solution in which there are fewer significant discriminant functions than the maximum number possible, our problems will require that all of the possible discriminant functions be significant.

Page 7: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 7

Interpreting the relationship between independent and dependent variables

The interpretative statement about the relationship between the independent variable and the dependent variable is a statement like: cases in group A tended to have higher scores on variable X than cases in group B or group C.

This interpretation is complicated by the fact that the relationship is not direct, but operates through the discriminant function.

Dependent variable groups are distinguished by scores on discriminant functions, not on values of independent variables. The scores on functions are based on the values of the independent variables that are multiplied by the function coefficients.

Page 8: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 8

Groups, functions, and variables

To interpret the relationship between an independent variable and the dependent variable, we must first identify how the discriminant functions separate the groups, and then the role of the independent variable is for each function.

SPSS provides a table called "Functions at Group Centroids" (multivariate means) that indicates which groups are separated by which functions.

SPSS provides another table called the "Structure Matrix" which, like its counterpart in factor analysis, identifies the loading, or correlation, between each independent variable and each function. This tells us which variables to interpret for each function. Each variable is interpreted on the function that it loads most highly on.

Page 9: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 9

Functions at Group Centroids

-.220 .235

.446 -.031

-.311 -.362

WELFARE1

2

3

1 2

Function

Unstandardized canonical discriminantfunctions evaluated at group means

Functions at Group Centroids

In order to specify the role that each independent variable plays in predicting group membership on the dependent variable, we must link together the relationship between the discriminant functions and the groups defined by the dependent variable, the role of the significant independent variables in the discriminant functions, and the differences in group means for each of the variables.

Function 1 separates survey respondents who thought we spend about the right amount of money on welfare (the positive value of 0.446) from survey respondents who thought we spend too much (negative value of -0.311) or little money (negative value of -0.220) on welfare.

Function 2 separates survey respondents who thought we spend too little money on welfare (positive value of 0.235) from survey respondents who thought we spend too much money (negative value of -0.362) on welfare. We ignore the second group (-0.031) in this comparison because it was distinguished from the other two groups by function 1.

Page 10: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 10

Structure Matrix

.687* .136

-.582* .345

.223 .889*

.101 .292*

HIGHEST YEAR OFSCHOOL COMPLETED

NUMBER OF HOURSWORKED LAST WEEK

R SELF-EMP OR WORKSFOR SOMEBODY

RESPONDENTS INCOMEa

1 2

Function

Pooled within-groups correlations between discriminatingvariables and standardized canonical discriminant functions Variables ordered by absolute size of correlation within function.

Structure Matrix

Based on the structure matrix, the predictor variables strongly associated with discriminant function 1 which distinguished between survey respondents who thought we spend about the right amount of money on welfare and survey respondents who thought we spend too much or little money on welfare were number of hours worked in the past week (r=-0.582) and highest year of school completed (r=0.687).

Based on the structure matrix, the predictor variable strongly associated with discriminant function 2 which distinguished between survey respondents who thought we spend too little money on welfare and survey respondents who thought we spend too much money on welfare was self-employment (r=0.889).

We do not interpret loadings in the structure matrix unless they are 0.30 or higher.

Page 11: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 11

Group Statistics

43.96 13.240 56 56.000

13.73 2.401 56 56.000

1.93 .260 56 56.000

13.70 5.034 56 56.000

37.90 13.235 50 50.000

14.78 2.558 50 50.000

1.90 .303 50 50.000

14.00 5.503 50 50.000

42.03 10.456 32 32.000

13.38 2.524 32 32.000

1.75 .440 32 32.000

14.75 5.304 32 32.000

41.32 12.846 138 138.000

NUMBER OF HOURSWORKED LAST WEEK

HIGHEST YEAR OFSCHOOL COMPLETED

R SELF-EMP OR WORKSFOR SOMEBODY

RESPONDENTS INCOME

NUMBER OF HOURSWORKED LAST WEEK

HIGHEST YEAR OFSCHOOL COMPLETED

R SELF-EMP OR WORKSFOR SOMEBODY

RESPONDENTS INCOME

NUMBER OF HOURSWORKED LAST WEEK

HIGHEST YEAR OFSCHOOL COMPLETED

R SELF-EMP OR WORKSFOR SOMEBODY

RESPONDENTS INCOME

NUMBER OF HOURSWORKED LAST WEEK

WELFARE1 TOO LITTLE

2 ABOUT RIGHT

3 TOO MUCH

Total

Mean Std. Deviation Unweighted Weighted

Valid N (listwise)

Group Statistics

The average number of hours worked in the past week for survey respondents who thought we spend about the right amount of money on welfare (mean=37.90) was lower than the average number of hours worked in the past weeks for survey respondents who thought we spend too much money on welfare (mean=43.96) and survey respondents who thought we spend too little money on welfare (mean=42.03).

This enables us to make the statement: "survey respondents who thought we spend about the right amount of money on welfare worked fewer hours in the past week than survey respondents who thought we spend too much or little money on welfare."

Page 12: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 12

Which independent variables to interpret

In a simultaneous discriminant analysis, in which all independent variables are entered together, we only interpret the relationships for independent variables that have a loading of 0.30 or higher one or more discriminant functions. A variable can have a high loading on more than one function, which complicates the interpretation. We will interpret the variable for the function on which it has the highest loading.

In a stepwise discriminant analysis, we limit the interpretation of relationships between independent variables and groups defined by the dependent variable to those independent variables that met the statistical test for inclusion in the analysis.

Page 13: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 13

Discriminant analysis and classification

Discriminant analysis consists of two stages: in the first stage, the discriminant functions are derived; in the second stage, the discriminant functions are used to classify the cases.

While discriminant analysis does compute correlation measures to estimate the strength of the relationship, these correlations measure the relationship between the independent variables and the discriminant scores.

A more useful measure to assess the utility of a discriminant model is classification accuracy, which compares predicted group membership based on the discriminant model to the actual, known group membership which is the value for the dependent variable.

Page 14: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 14

Evaluating usefulness for discriminant models

The benchmark that we will use to characterize a discriminant model as useful is a 25% improvement over the rate of accuracy achievable by chance alone.

Even if the independent variables had no relationship to the groups defined by the dependent variable, we would still expect to be correct in our predictions of group membership some percentage of the time. This is referred to as by chance accuracy.

The estimate of by chance accuracy that we will use is the proportional by chance accuracy rate, computed by summing the squared percentage of cases in each group.

Page 15: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 15

Comparing accuracy rates

To characterize our model as useful, we compare the cross-validated accuracy rate produced by SPSS to 25% more than the proportional by chance accuracy.

The cross-validated accuracy rate is a one-at-a-time hold out method that classifies each case based on a discriminant solution for all of the other cases in the analysis. It is a more realistic estimate of the accuracy rate we should expect in the population because discriminant analysis inflates accuracy rates when the cases classified are the same cases used to derive the discriminant functions.

Cross-validated accuracy rates are not produced by SPSS when separate covariance matrices are used in the classification, which we address more next week.

Page 16: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 16

Computing by chance accuracy

The percentage of cases in each group defined by the dependent variable are reported in the table "Prior Probabilities for Groups"

Prior Probabilities for Groups

.406 56 56.000

.362 50 50.000

.232 32 32.000

1.000 138 138.000

WELFARE1 TOO LITTLE

2 ABOUT RIGHT

3 TOO MUCH

Total

Prior Unweighted Weighted

Cases Used in Analysis

The proportional by chance accuracy rate was computed by squaring and summing the proportion of cases in each group from the table of prior probabilities for groups (0.406² + 0.362² + 0.232² = 0.350).

A 25% increase over this would require that our cross-validated accuracy be 43.7% (1.25 x 35.0% = 43.7%).

Page 17: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 17

Classification Resultsb,c

43 15 6 64

26 30 6 62

17 10 9 36

3 3 2 8

67.2 23.4 9.4 100.0

41.9 48.4 9.7 100.0

47.2 27.8 25.0 100.0

37.5 37.5 25.0 100.0

43 15 6 64

26 30 6 62

17 11 8 36

67.2 23.4 9.4 100.0

41.9 48.4 9.7 100.0

47.2 30.6 22.2 100.0

WELFARE1 TOO LITTLE

2 ABOUT RIGHT

3 TOO MUCH

Ungrouped cases

1 TOO LITTLE

2 ABOUT RIGHT

3 TOO MUCH

Ungrouped cases

1 TOO LITTLE

2 ABOUT RIGHT

3 TOO MUCH

1 TOO LITTLE

2 ABOUT RIGHT

3 TOO MUCH

Count

%

Count

%

Original

Cross-validateda

1 TOOLITTLE

2 ABOUTRIGHT 3 TOO MUCH

Predicted Group Membership

Total

Cross validation is done only for those cases in the analysis. In cross validation, each case isclassified by the functions derived from all cases other than that case.

a.

50.6% of original grouped cases correctly classified.b.

50.0% of cross-validated grouped cases correctly classified.c.

Comparing the cross-validated accuracy rate

SPSS reports the cross-validated accuracy rate in the footnotes to the table "Classification Results." The cross-validated accuracy rate computed by SPSS was 50.0% which was greater than or equal to the proportional by chance accuracy criteria of 43.7%.

Page 18: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 18

Problem 1

1. In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, violation of assumptions, or outliers. Use a level of significance of 0.05 for evaluating the statistical relationship.

The variables "age" [age], "highest year of school completed" [educ], "sex" [sex], and "income" [rincom98] are useful in distinguishing between groups based on responses to "seen x-rated movie in last year" [xmovie]. These predictors differentiate survey respondents who had seen an x-rated movie in the last year from survey respondents who had not seen an x-rated movie in the last year.

Survey respondents who had seen an x-rated movie in the last year were younger than survey respondents who had not seen an x-rated movie in the last year. Survey respondents who had seen an x-rated movie in the last year were more likely to be male than survey respondents who had not seen an x-rated movie in the last year.

1. True 2. True with caution 3. False 4. Inappropriate application of a statistic

Page 19: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 19

Dissecting problem 1 - 1

In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, violation of assumptions, or outliers. Use a level of significance of 0.05 for evaluating the statistical relationship.

The variables "age" [age], "highest year of school completed" [educ], "sex" [sex], and "income" [rincom98] are useful in distinguishing between groups based on responses to "seen x-rated movie in last year" [xmovie]. These predictors differentiate survey respondents who had seen an x-rated movie in the last year from survey respondents who had not seen an x-rated movie in the last year.

Survey respondents who had seen an x-rated movie in the last year were younger than survey respondents who had not seen an x-rated movie in the last year. Survey respondents who had seen an x-rated movie in the last year were more likely to be male than survey respondents who had not seen an x-rated movie in the last year.

1. True 2. True with caution 3. False 4. Inappropriate application of a statistic

For these problems, we will assume that there is no problem with missing data, violation of assumptions, or outliers.

In this problem, we are told to use 0.05 as alpha for the discriminant analysis.

Page 20: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 20

Dissecting problem 1 - 2

1. In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, violation of assumptions, or outliers. Use a level of significance of 0.05 for evaluating the statistical relationship.

The variables "age" [age], "highest year of school completed" [educ], "sex" [sex], and "income" [rincom98] are useful in distinguishing between groups based on responses to "seen x-rated movie in last year" [xmovie]. These predictors differentiate survey respondents who had seen an x-rated movie in the last year from survey respondents who had not seen an x-rated movie in the last year.

Survey respondents who had seen an x-rated movie in the last year were younger than survey respondents who had not seen an x-rated movie in the last year. Survey respondents who had seen an x-rated movie in the last year were more likely to be male than survey respondents who had not seen an x-rated movie in the last year.

When a problem states that a list of independent variables can distinguish among groups, we do a discriminant analysis entering all of the variables simultaneously.

The variables listed first in the problem statement are the independent variables (IVs): "age" [age], "highest year of school completed" [educ], "sex" [sex], and "income" [rincom98].

The variable used to define groups is the dependent variable (DV): "seen x-rated movie in last year" [xmovie].

Page 21: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 21

Dissecting problem 1 - 3

In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, violation of assumptions, or outliers. Use a level of significance of 0.05 for evaluating the statistical relationship.

The variables "age" [age], "highest year of school completed" [educ], "sex" [sex], and "income" [rincom98] are useful in distinguishing between groups based on responses to "seen x-rated movie in last year" [xmovie]. These predictors differentiate survey respondents who had seen an x-rated movie in the last year from survey respondents who had not seen an x-rated movie in the last year.

Survey respondents who had seen an x-rated movie in the last year were younger than survey respondents who had not seen an x-rated movie in the last year. Survey respondents who had seen an x-rated movie in the last year were more likely to be male than survey respondents who had not seen an x-rated movie in the last year.

1. True 2. True with caution 3. False 4. Inappropriate application of a statistic

The problem identifies two groups for the dependent variable:

•survey respondents who had seen an x-rated movie in the last year •survey respondents who had not seen an x-rated movie in the last year

To distinguish among two groups, the analysis will be required to find one statistically significant discriminant function.

Page 22: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 22

Dissecting problem 1 - 4

The variables "age" [age], "highest year of school completed" [educ], "sex" [sex], and "income" [rincom98] are useful in distinguishing between groups based on responses to "seen x-rated movie in last year" [xmovie]. These predictors differentiate survey respondents who had seen an x-rated movie in the last year from survey respondents who had not seen an x-rated movie in the last year.

Survey respondents who had seen an x-rated movie in the last year were younger than survey respondents who had not seen an x-rated movie in the last year. Survey respondents who had seen an x-rated movie in the last year were more likely to be male than survey respondents who had not seen an x-rated movie in the last year.

1. True 2. True with caution 3. False 4. Inappropriate application of a statistic

The specific relationships listed in the problem indicate how the independent variable relates to groups of the dependent variable, i.e., the mean for age will be lower for respondents who had seen an x-rated movie in the last year.

In order for the discriminant analysis to be true, we must have enough statistically significant functions to distinguish among the groups, the classification accuracy rate must be substantially better than could be obtained by chance alone, and each significant relationship must be interpreted correctly.

Page 23: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 23

LEVEL OF MEASUREMENT - 1

In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, violation of assumptions, or outliers. Use a level of significance of 0.05 for evaluating the statistical relationship.

The variables "age" [age], "highest year of school completed" [educ], "sex" [sex], and "income" [rincom98] are useful in distinguishing between groups based on responses to "seen x-rated movie in last year" [xmovie]. These predictors differentiate survey respondents who had seen an x-rated movie in the last year from survey respondents who had not seen an x-rated movie in the last year.

Survey respondents who had seen an x-rated movie in the last year were younger than survey respondents who had not seen an x-rated movie in the last year. Survey respondents who had seen an x-rated movie in the last year were more likely to be male than survey respondents who had not seen an x-rated movie in the last year.

1. True 2. True with caution 3. False 4. Inappropriate application of a statistic

Discriminant analysis requires that the dependent variable be non-metric and the independent variables be metric or dichotomous. "seen x-rated movie in last year" [xmovie] is an dichotomous variable, which satisfies the level of measurement requirement.

It contains two categories: survey respondents who had seen an x-rated movie in the last year and survey respondents who had not seen an x-rated movie in the last year.

Page 24: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 24

LEVEL OF MEASUREMENT - 2

In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, violation of assumptions, or outliers. Use a level of significance of 0.05 for evaluating the statistical relationship.

The variables "age" [age], "highest year of school completed" [educ], "sex" [sex], and "income" [rincom98] are useful in distinguishing between groups based on responses to "seen x-rated movie in last year" [xmovie]. These predictors differentiate survey respondents who had seen an x-rated movie in the last year from survey respondents who had not seen an x-rated movie in the last year.

Survey respondents who had seen an x-rated movie in the last year were younger than survey respondents who had not seen an x-rated movie in the last year. Survey respondents who had seen an x-rated movie in the last year were more likely to be male than survey respondents who had not seen an x-rated movie in the last year.

1. True 2. True with caution 3. False 4. Inappropriate application of a statistic

"Income" [rincom98] is an ordinal level variable. If we follow the convention of treating ordinal level variables as metric variables, the level of measurement requirement for discriminant analysis is satisfied. Since some data analysts do not agree with this convention, a note of caution should be included in our interpretation.

"Age" [age] and "highest year of school completed" [educ] are interval level variables, which satisfies the level of measurement requirements for discriminant analysis.

"Sex" [sex] is a dichotomous or dummy-coded nominal variable which may be included in discriminant analysis.

Page 25: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 25

Request simultaneous discriminant analysis

Select the Classify | Discriminant… command from the Analyze menu.

Page 26: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 26

Selecting the dependent variable

Second, click on the right arrow button to move the dependent variable to the Grouping Variable text box.

First, highlight the dependent variable xmovie in the list of variables.

Page 27: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 27

Defining the group values

When SPSS moves the dependent variable to the Grouping Variable textbox, it puts two question marks in parentheses after the variable name. This is a reminder that we have to enter the number that represent the groups we want to include in the analysis.

First, to specify the group numbers, click on the Define Range… button.

Page 28: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 28

Completing the range of group values

The value labels for xmovie show two categories:

1 = YES2 = NO

The range of values that we need to enter goes from 1 as the minimum and 2 as the maximum.

Third, click on the Continue button to close the dialog box.

First, type in 1 in the Minimum text box.

Second, type in 2 in the Maximum text box.

Page 29: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 29

Selecting the independent variables

Move the independent variables listed in the problem to the Independents list box.

Page 30: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 30

Specifying the method for including variables

SPSS provides us with two methods for including variables: to enter all of the independent variables at one time, and a stepwise method for selecting variables using a statistical test to determine the order in which variables are included.

Since the problem states that there is a relationship without requesting the best predictors, we accept the default to Enter independents together.

Page 31: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 31

Requesting statistics for the output

Click on the Statistics… button to select statistics we will need for the analysis.

Page 32: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 32

Specifying statistical output

Fourth, click on the Continue button to close the dialog box.

First, mark the Means checkbox on the Descriptives panel. We will use the group means in our interpretation.

Second, mark the Univariate ANOVAs checkbox on the Descriptives panel. Perusing these tests suggests which variables might be useful descriminators.

Third, mark the Box’s M checkbox. Box’s M statistic evaluates conformity to the assumption of homogeneity of group variances.

Page 33: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 33

Specifying details for classification

Click on the Classify… button to specify details for the classification phase of the analysis.

Page 34: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 34

Details for classification - 1

Third, mark the Summary table checkbox to include summary tables comparing actual and predicted classification.

First, mark the option button to Compute from group sizes on the Prior Probabilities panel. This incorporates the size of the groups defined by the dependent variable into the classification of cases using the discriminant functions.

Second, mark the Casewise results checkbox on the Display panel to include classification details for each case in the output.

Page 35: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 35

Details for classification - 2

Fourth, mark the Leave-one-out classification checkbox to request SPSS to include a cross-validated classification in the output. This option produces a less biased estimate of classification accuracy by sequentially holding each case out of the calculations for the discriminant functions, and using the derived functions to classify the case held out.

Page 36: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 36

Details for classification - 3

Sixth, mark the Combines-groups checkbox on the Plots panel to obtain a visual plot of the relationship between functions and groups defined by the dependent variable.

Fifth, accept the default of Within-groups option button on the Use Covariance Matrix panel. The Covariance matrices are the measure of the dispersion in the groups defined by the dependent variable. If we fail the homogeneity of group variances test (Box’s M), our option is use Separate groups covariance in classification.

Seventh, click on the Continue button to close the dialog box.

Page 37: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 37

Completing the discriminant analysis request

Click on the OK button to request the output for the disciminant analysis.

Page 38: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 38

Analysis Case Processing Summary

119 44.1

49 18.1

66 24.4

36 13.3

151 55.9

270 100.0

Unweighted CasesValid

Missing or out-of-rangegroup codes

At least one missingdiscriminating variable

Both missing orout-of-range group codesand at least one missingdiscriminating variable

Total

Excluded

Total

N Percent

Sample size – ratio of cases to variables

The minimum ratio of valid cases to independent variables for discriminant analysis is 5 to 1, with a preferred ratio of 20 to 1. In this analysis, there are 119 valid cases and 4 independent variables. The ratio of cases to independent variables is 29.75 to 1, which satisfies the minimum requirement. In addition, the ratio of 29.75 to 1 satisfies the preferred ratio of 20 to 1.

Page 39: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 39

Prior Probabilities for Groups

.311 37 37.000

.689 82 82.000

1.000 119 119.000

XMOVIE1

2

Total

Prior Unweighted Weighted

Cases Used in Analysis

Sample size – minimum group size

If the sample size did not initially satisfy the minimum requirements, discriminant analysis is not appropriate.

In addition to the requirement for the ratio of cases to independent variables, discriminant analysis requires that there be a minimum number of cases in the smallest group defined by the dependent variable. The number of cases in the smallest group must be larger than the number of independent variables, and preferably contains 20 or more cases.

The number of cases in the smallest group in this problem is 37, which is larger than the number of independent variables (4), satisfying the minimum requirement. In addition, the number of cases in the smallest group satisfies the preferred minimum of 20 cases.

Page 40: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 40

NUMBER OF DISCRIMINANT FUNCTIONS - 1

The maximum possible number of discriminant functions is the smaller of one less than the number of groups defined by the dependent variable and the number of independent variables.

In this analysis there were 2 groups defined by seen x-rated movie in last year and 4 independent variables, so the maximum possible number of discriminant functions was 1.

Page 41: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 41

NUMBER OF DISCRIMINANT FUNCTIONS - 2

In the table of Wilks' Lambda which tested functions for statistical significance, the direct analysis identified 1 discriminant functions that were statistically significant. The Wilks' lambda statistic for the test of function 1 (chi-square=24.159) had a probability of <0.001 which was less than or equal to the level of significance of 0.05. The significance of the maximum possible number of discriminant functions supports the interpretation of a solution using 1 discriminant function.

Page 42: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 42

Functions at Group Centroids

-.714

.322

XMOVIE1

2

1

Function

Unstandardized canonical discriminantfunctions evaluated at group means

Independent variables and group membership:

relationship of functions to groups

In order to specify the role that each independent variable plays in predicting group membership on the dependent variable, we must link together the relationship between the discriminant functions and the groups defined by the dependent variable, the role of the significant independent variables in the discriminant functions, and the differences in group means for each of the variables.

Each function divides the groups into two subgroups by assigning negative values to one subgroup and positive values to the other subgroup. Function 1 separates survey respondents who had seen an x-rated movie in the last year (-.714) from survey respondents who had not seen an x-rated movie in the last year (.322).

Page 43: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 43

Structure Matrix

.770

.467

.118

.044

SEX

AGE

EDUC

RINCOM98

1

Function

Pooled within-groups correlations between discriminatingvariables and standardized canonical discriminant functions Variables ordered by absolute size of correlation within function.

Independent variables and group membership:

predictor loadings on functions

Based on the structure matrix, the predictor variables strongly associated with discriminant function 1 which distinguished between survey respondents who had seen an x-rated movie in the last year and survey respondents who had not seen an x-rated movie in the last year were age (r=0.467) and sex (r=0.770).

We do not interpret loadings in the structure matrix unless they are 0.30 or higher.

Page 44: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 44

Group Statistics

37.24 10.838 37 37.000

13.86 2.720 37 37.000

1.27 .450 37 37.000

13.76 5.209 37 37.000

42.70 11.461 82 82.000

14.18 2.534 82 82.000

1.65 .481 82 82.000

14.00 5.308 82 82.000

41.00 11.508 119 119.000

14.08 2.586 119 119.000

1.53 .501 119 119.000

13.92 5.256 119 119.000

AGE

EDUC

SEX

RINCOM98

AGE

EDUC

SEX

RINCOM98

AGE

EDUC

SEX

RINCOM98

XMOVIE1

2

Total

Mean Std. Deviation Unweighted Weighted

Valid N (listwise)

Independent variables and group membership:

predictors associated with first function - 1

The average age for survey respondents who had seen an x-rated movie in the last year (mean=37.24) was lower than the average age for survey respondents who had not seen an x-rated movie in the last year (mean=42.70).

This supports the relationship that "survey respondents who had seen an x-rated movie in the last year were younger than survey respondents who had not seen an x-rated movie in the last year."

Page 45: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 45

Group Statistics

37.24 10.838 37 37.000

13.86 2.720 37 37.000

1.27 .450 37 37.000

13.76 5.209 37 37.000

42.70 11.461 82 82.000

14.18 2.534 82 82.000

1.65 .481 82 82.000

14.00 5.308 82 82.000

41.00 11.508 119 119.000

14.08 2.586 119 119.000

1.53 .501 119 119.000

13.92 5.256 119 119.000

AGE

EDUC

SEX

RINCOM98

AGE

EDUC

SEX

RINCOM98

AGE

EDUC

SEX

RINCOM98

XMOVIE1

2

Total

Mean Std. Deviation Unweighted Weighted

Valid N (listwise)

Independent variables and group membership:

predictors associated with first function - 2

Since sex is a dichotomous variable, the mean is not directly interpretable. Its interpretation must take into account the coding by which 1 corresponds to male and 2 corresponds to female. The lower mean for survey respondents who had seen an x-rated movie in the last year (mean=1.27), when compared to the mean for survey respondents who had not seen an x-rated movie in the last year (mean=1.65), implies that the group contained more survey respondents who were male and fewer survey respondents who were female.

This supports the relationship that "survey respondents who had seen an x-rated movie in the last year were more likely to be male than survey respondents who had not seen an x-rated movie in the last year."

Page 46: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 46

Prior Probabilities for Groups

.311 37 37.000

.689 82 82.000

1.000 119 119.000

XMOVIE1

2

Total

Prior Unweighted Weighted

Cases Used in Analysis

CLASSIFICATION USING THE DISCRIMINANT MODEL:

by chance accuracy rate

The independent variables could be characterized as useful predictors of membership in the groups defined by the dependent variable if the cross-validated classification accuracy rate was significantly higher than the accuracy attainable by chance alone. Operationally, the cross-validated classfication accuracy rate should be 25% or more higher than the proportional by chance accuracy rate.

The proportional by chance accuracy rate was computed by squaring and summing the proportion of cases in each group from the table of prior probabilities for groups (0.311² + 0.689² = 0.571).

Page 47: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 47

Classification Resultsb,c

15 22 37

12 70 82

13 36 49

40.5 59.5 100.0

14.6 85.4 100.0

26.5 73.5 100.0

15 22 37

12 70 82

40.5 59.5 100.0

14.6 85.4 100.0

XMOVIE1

2

Ungrouped cases

1

2

Ungrouped cases

1

2

1

2

Count

%

Count

%

Original

Cross-validateda

1 2

Predicted GroupMembership

Total

Cross validation is done only for those cases in the analysis. In crossvalidation, each case is classified by the functions derived from all cases otherthan that case.

a.

71.4% of original grouped cases correctly classified.b.

71.4% of cross-validated grouped cases correctly classified.c.

CLASSIFICATION USING THE DISCRIMINANT MODEL:

criteria for classification accuracy

The cross-validated accuracy rate computed by SPSS was 71.4% which was greater than or equal to the proportional by chance accuracy criteria of 71.4% (1.25 x 57.1% = 71.4%).

The criteria for classification accuracy is satisfied.

Page 48: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 48

Answering the question in problem 1 - 1

In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, violation of assumptions, or outliers. Use a level of significance of 0.05 for evaluating the statistical relationship.

The variables "age" [age], "highest year of school completed" [educ], "sex" [sex], and "income" [rincom98] are useful in distinguishing between groups based on responses to "seen x-rated movie in last year" [xmovie]. These predictors differentiate survey respondents who had seen an x-rated movie in the last year from survey respondents who had not seen an x-rated movie in the last year.

Survey respondents who had seen an x-rated movie in the last year were younger than survey respondents who had not seen an x-rated movie in the last year. Survey respondents who had seen an x-rated movie in the last year were more likely to be male than survey respondents who had not seen an x-rated movie in the last year.

1. True 2. True with caution 3. False 4. Inappropriate application of a statistic

We found one statistically significant discriminant function, making it possible to distinguish among the two groups defined by the dependent variable.

Moreover, the cross-validated classification accuracy surpassed the by chance accuracy criteria, supporting the utility of the model.

Page 49: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 49

Answering the question in problem 1 - 2

In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, violation of assumptions, or outliers. Use a level of significance of 0.05 for evaluating the statistical relationship.

The variables "age" [age], "highest year of school completed" [educ], "sex" [sex], and "income" [rincom98] are useful in distinguishing between groups based on responses to "seen x-rated movie in last year" [xmovie]. These predictors differentiate survey respondents who had seen an x-rated movie in the last year from survey respondents who had not seen an x-rated movie in the last year.

Survey respondents who had seen an x-rated movie in the last year were younger than survey respondents who had not seen an x-rated movie in the last year. Survey respondents who had seen an x-rated movie in the last year were more likely to be male than survey respondents who had not seen an x-rated movie in the last year.

1. True 2. True with caution 3. False 4. Inappropriate application of a statistic

We verified that each statement about the relationship between predictors and groups was correct.

The answer to the question is true with caution.

A caution is added because of the inclusion of ordinal level variables.

Page 50: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 50

Problem 2

In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, violation of assumptions, or outliers. Use a level of significance of 0.05 for evaluating the statistical relationship.

From the list of variables "respondent's degree of religious fundamentalism" [fund], "frequency of prayer" [pray], and "frequency of attendance at religious services" [attend], the most useful predictor for distinguishing between groups based on responses to "attitude toward abortion when there is a strong chance of serious defect in the baby" [abdefect] is "frequency of prayer" [pray]. These predictors differentiate survey respondents who thought it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby from survey respondents who didn't think it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby.

The most important predictor of groups based on responses to attitude toward abortion when there is a strong chance of serious defect in the baby was frequency of prayer.

Survey respondents who didn't think it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby prayed more often than survey respondents who thought it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby.

1. True 2. True with caution 3. False 4. Inappropriate application of a statistic

Page 51: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 51

Dissecting problem 2 - 1

In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, violation of assumptions, or outliers. Use a level of significance of 0.05 for evaluating the statistical relationship.

From the list of variables "respondent's degree of religious fundamentalism" [fund], "frequency of prayer" [pray], and "frequency of attendance at religious services" [attend], the most useful predictor for distinguishing between groups based on responses to "attitude toward abortion when there is a strong chance of serious defect in the baby" [abdefect] is "frequency of prayer" [pray]. These predictors differentiate survey respondents who thought it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby from survey respondents who didn't think it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby.

The most important predictor of groups based on responses to attitude toward abortion when there is a strong chance of serious defect in the baby was frequency of prayer.

When a problem asks us to identify the best or most useful predictors from a list of independent variables, we do stepwise discriminant analysis.

The variables listed first in the problem statement are the independent variables (IVs): "respondent's degree of religious fundamentalism" [fund], "frequency of prayer" [pray], and "frequency of attendance at religious services" [attend].

The variable used to define groups is the dependent variable (DV): "attitude toward abortion when there is a strong chance of serious defect in the baby" [abdefect]

Page 52: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 52

Dissecting problem 2 - 2

In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, violation of assumptions, or outliers. Use a level of significance of 0.05 for evaluating the statistical relationship.

From the list of variables "respondent's degree of religious fundamentalism" [fund], "frequency of prayer" [pray], and "frequency of attendance at religious services" [attend], the most useful predictor for distinguishing between groups based on responses to "attitude toward abortion when there is a strong chance of serious defect in the baby" [abdefect] is "frequency of prayer" [pray]. These predictors differentiate survey respondents who thought it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby from survey respondents who didn't think it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby.

The most important predictor of groups based on responses to attitude toward abortion when there is a strong chance of serious defect in the baby was frequency of prayer.

The problem identifies two groups for the dependent variable:•survey respondents who thought it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby •survey respondents who didn't think it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby.

To distinguish among two groups, the analysis will be required to find one statistically significant discriminant functions.

The importance of predictors is based upon the stepwise addition of variables to the analysis.

Page 53: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 53

Dissecting problem 2 - 3

From the list of variables "respondent's degree of religious fundamentalism" [fund], "frequency of prayer" [pray], and "frequency of attendance at religious services" [attend], the most useful predictor for distinguishing between groups based on responses to "attitude toward abortion when there is a strong chance of serious defect in the baby" [abdefect] is "frequency of prayer" [pray]. These predictors differentiate survey respondents who thought it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby from survey respondents who didn't think it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby.

The most important predictor of groups based on responses to attitude toward abortion when there is a strong chance of serious defect in the baby was frequency of prayer.

Survey respondents who didn't think it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby prayed more often than survey respondents who thought it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby.

1. True 2. True with caution 3. False 4. Inappropriate application of a statistic

The specific relationships listed in the problem indicate how the independent variable relates to groups of the dependent variable, i.e., the mean for frequency of prayer will be lower for respondents who thought it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby compared to survey respondents who didn't think it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby.

In a stepwise analysis, we only interpret the independent variables that are entered in the stepwise analysis.

In order for a stepwise analysis to be true, we must have enough statistically significant functions to distinguish among the groups, the order of entry must be correct, and each significant relationship must be interpreted correctly.

Page 54: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 54

LEVEL OF MEASUREMENT - 1

In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, violation of assumptions, or outliers. Use a level of significance of 0.05 for evaluating the statistical relationship.

From the list of variables "respondent's degree of religious fundamentalism" [fund], "frequency of prayer" [pray], and "frequency of attendance at religious services" [attend], the most useful predictor for distinguishing between groups based on responses to "attitude toward abortion when there is a strong chance of serious defect in the baby" [abdefect] is "frequency of prayer" [pray]. These predictors differentiate survey respondents who thought it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby from survey respondents who didn't think it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby.

The most important predictor of groups based on responses to attitude toward abortion when there is a strong chance of serious defect in the baby was frequency of prayer.

Survey respondents who didn't think it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby prayed more often than survey respondents who thought it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby.

Discriminant analysis requires that the dependent variable be non-metric and the independent variables be metric or dichotomous.

"Attitude toward abortion when there is a strong chance of serious defect in the baby" [abdefect] is a nominal level variable, which satisfies the level of measurement requirement.

Page 55: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 55

LEVEL OF MEASUREMENT - 2

In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, violation of assumptions, or outliers. Use a level of significance of 0.05 for evaluating the statistical relationship.

From the list of variables "respondent's degree of religious fundamentalism" [fund], "frequency of prayer" [pray], and "frequency of attendance at religious services" [attend], the most useful predictor for distinguishing between groups based on responses to "attitude toward abortion when there is a strong chance of serious defect in the baby" [abdefect] is "frequency of prayer" [pray]. These predictors differentiate survey respondents who thought it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby from survey respondents who didn't think it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby.

The most important predictor of groups based on responses to attitude toward abortion when there is a strong chance of serious defect in the baby was frequency of prayer.

Survey respondents who didn't think it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby prayed more often than survey respondents who thought it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby.

"Respondent's degree of religious fundamentalism" [fund], "frequency of prayer" [pray], and "frequency of attendance at religious services" [attend] are ordinal level variables. If we follow the convention of treating ordinal level variables as metric variables, the level of measurement requirement for discriminant analysis is satisfied. Since some data analysts do not agree with this convention, a note of caution should be included in our interpretation.

Page 56: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 56

Request stepwise discriminant analysis

Select the Classify | Discriminant… command from the Analyze menu.

Page 57: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 57

Selecting the dependent variable

Second, click on the right arrow button to move the dependent variable to the Grouping Variable text box.

First, highlight the dependent variable abdefect in the list of variables.

Page 58: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 58

Defining the group values

When SPSS moves the dependent variable to the Grouping Variable textbox, it puts two question marks in parentheses after the variable name. This is a reminder that we have to enter the number that represent the groups we want to include in the analysis.

First, to specify the group numbers, click on the Define Range… button.

Page 59: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 59

Completing the range of group values

The value labels for abdefect show two categories:

1 = YES2 = NO

The range of values that we need to enter goes from 1 as the minimum and 2 as the maximum.

Third, click on the Continue button to close the dialog box.

First, type in 1 in the Minimum text box.

Second, type in 2 in the Maximum text box.

Page 60: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 60

Selecting the independent variables

Move the independent variables listed in the problem to the Independents list box.

Page 61: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 61

Specifying the method for including variables

SPSS provides us with two methods for including variables: to enter all of the independent variables at one time, and a stepwise method for selecting variables using a statistical test to determine the order in which variables are included.

Since the problem calls for identifying the best predictors, we click on the option button to Use stepwise method.

Page 62: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 62

Requesting statistics for the output

Click on the Statistics… button to select statistics we will need for the analysis.

Page 63: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 63

Specifying statistical output

Fourth, click on the Continue button to close the dialog box.

First, mark the Means checkbox on the Descriptives panel. We will use the group means in our interpretation.

Second, mark the Univariate ANOVAs checkbox on the Descriptives panel. Perusing these tests suggests which variables might be useful descriminators.

Third, mark the Box’s M checkbox. Box’s M statistic evaluates conformity to the assumption of homogeneity of group variances.

Page 64: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 64

Specifying details for the stepwise method

Click on the Method… button to specify the specific statistical criteria to use for including variables.

Page 65: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 65

Details for the stepwise method

Third, click on the option button Use probability of F so that we can incorporate the level of significance specified in the problem.

First, mark the Mahalanobis distance option button on the Method panel.

Third, click on the Continue button to close the dialog box.

Second, mark the Summary of steps checkbox to produce a summary table when a new variable is added.

Fourth, type the level of significance in the Entry text box. The Removal value is twice as large as the entry value.

Page 66: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 66

Specifying details for classification

Click on the Classify… button to specify details for the classification phase of the analysis.

Page 67: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 67

Details for classification - 1

Third, mark the Summary table checkbox to include summary tables comparing actual and predicted classification.

First, mark the option button to Compute from group sizes on the Prior Probabilities panel. This incorporates the size of the groups defined by the dependent variable into the classification of cases using the discriminant functions.

Second, mark the Casewise results checkbox on the Display panel to include classification details for each case in the output.

Page 68: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 68

Details for classification - 2

Fourth, mark the Leave-one-out classification checkbox to request SPSS to include a cross-validated classification in the output. This option produces a less biased estimate of classification accuracy by sequentially holding each case out of the calculations for the discriminant functions, and using the derived functions to classify the case held out.

Page 69: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 69

Details for classification - 3

Sixth, mark the Combines-groups checkbox on the Plots panel to obtain a visual plot of the relationship between functions and groups defined by the dependent variable.

Fifth, accept the default of Within-groups option button on the Use Covariance Matrix panel. The Covariance matrices are the measure of the dispersion in the groups defined by the dependent variable. If we fail the homogeneity of group variances test (Box’s M), our option is use Separate groups covariance in classification.

Seventh, click on the Continue button to close the dialog box.

Page 70: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 70

Completing the discriminant analysis request

Click on the OK button to request the output for the disciminant analysis.

Page 71: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 71

Analysis Case Processing Summary

77 28.5

41 15.2

105 38.9

47 17.4

193 71.5

270 100.0

Unweighted CasesValid

Missing or out-of-rangegroup codes

At least one missingdiscriminating variable

Both missing orout-of-range group codesand at least one missingdiscriminating variable

Total

Excluded

Total

N Percent

Sample size – ratio of cases to variables

The minimum ratio of valid cases to independent variables for discriminant analysis is 5 to 1, with a preferred ratio of 20 to 1. In this analysis, there are 77 valid cases and 3 independent variables. The ratio of cases to independent variables is 25.67 to 1, which satisfies the minimum requirement. In addition, the ratio of 25.67 to 1 satisfies the preferred ratio of 20 to 1.

Page 72: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 72

Prior Probabilities for Groups

.831 64 64.000

.169 13 13.000

1.000 77 77.000

STRONG CHANCE OFSERIOUS DEFECT1

2

Total

Prior Unweighted Weighted

Cases Used in Analysis

Sample size – minimum group size

If the sample size did not initially satisfy the minimum requirements, discriminant analysis is not appropriate.

In addition to the requirement for the ratio of cases to independent variables, discriminant analysis requires that there be a minimum number of cases in the smallest group defined by the dependent variable. The number of cases in the smallest group must be larger than the number of independent variables, and preferably contains 20 or more cases.

The number of cases in the smallest group in this problem is 13, which is larger than the number of independent variables (3), satisfying the minimum requirement. However, the number of cases in the smallest group is less than the preferred minimum of 20 cases. A caution should be added to the interpretation of the analysis.

Page 73: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 73

NUMBER OF DISCRIMINANT FUNCTIONS - 1

The maximum possible number of discriminant functions is the smaller of one less than the number of groups defined by the dependent variable and the number of independent variables.

In this analysis there were 2 groups defined by seen x-rated movie in last year and 3 independent variables, so the maximum possible number of discriminant functions was 1.

Page 74: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 74

NUMBER OF DISCRIMINANT FUNCTIONS - 2

In the table of Wilks' Lambda which tested functions for statistical significance, the stepwise analysis identified 1 discriminant functions that were statistically significant. The Wilks' lambda statistic for the test of function 1 (chi-square=3.887) had a probability of 0.049 which was less than or equal to the level of significance of 0.05.

The significance of the maximum possible number of discriminant functions supports the interpretation of a solution using 1 discriminant function.

Page 75: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 75

Functions at Group Centroids

.103

-.507

STRONG CHANCE OFSERIOUS DEFECT1

2

1

Function

Unstandardized canonical discriminantfunctions evaluated at group means

Independent variables and group membership:

relationship of functions to groups

In order to specify the role that each independent variable plays in predicting group membership on the dependent variable, we must link together the relationship between the discriminant functions and the groups defined by the dependent variable, the role of the significant independent variables in the discriminant functions, and the differences in group means for each of the variables.

Each function divides the groups into two subgroups by assigning negative values to one subgroup and positive values to the other subgroup. Function 1 separates survey respondents who didn't think it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby (-.507) from survey respondents who thought it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby (.103).

Page 76: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 76

Variables Entered/Removeda,b,c,d

HOWOFTENDOES RPRAY

.372 1 and 2 4.017 1 75.000 .049

Step1

Entered StatisticBetweenGroups Statistic df1 df2 Sig.

Exact F

Min. D Squared

At each step, the variable that maximizes the Mahalanobis distance between the two closestgroups is entered.

Maximum number of steps is 6.a.

Maximum significance of F to enter is .05.b.

Minimum significance of F to remove is .10.

Independent variables and group membership:

which predictors to interpret

When we use the stepwise method of variable inclusion, we limit our interpretation of independent variable predictors to those listed as statistically significant in the table of Variables Entered/Removed.

The stepwise method of variable selection identified 1 variable that satisfied the level of significance of 0.05. The most important predictor of groups based on responses to attitude toward abortion when there is a strong chance of serious defect in the baby was:

•frequency of prayer.

Had we use simultaneous entry of all variables, we would not have imposed this limitation.

Page 77: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 77

Structure Matrix

1.000

-.511

.336

PRAY

ATTENDa

FUNDa

1

Function

Pooled within-groups correlations between discriminatingvariables and standardized canonical discriminant functions Variables ordered by absolute size of correlation within function.

This variable not used in the analysis.a.

Independent variables and group membership:

predictor loadings on functions

Based on the structure matrix, the predictor variable strongly associated with discriminant function 1 which distinguished between survey respondents who didn't think it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby and survey respondents who thought it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby was frequency of prayer (r=1.000).

The correlation of 1.0 is an artifact of having only one statistically significant variable.

While we would normally interpret loadings in the structure matrix if they are 0.30 or higher, when we do stepwise analysis, we limit ourselves to the variables that were statistically significant.

Page 78: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 78

Group Statistics

3.05 2.627 64 64.000

3.05 1.608 64 64.000

2.03 .776 64 64.000

4.23 2.948 13 13.000

2.08 1.498 13 13.000

1.69 .630 13 13.000

3.25 2.701 77 77.000

2.88 1.622 77 77.000

1.97 .760 77 77.000

ATTEND

PRAY

FUND

ATTEND

PRAY

FUND

ATTEND

PRAY

FUND

ABDEFECT1

2

Total

Mean Std. Deviation Unweighted Weighted

Valid N (listwise)

Independent variables and group membership:

predictors associated with first function - 1

The average frequency of prayer for survey respondents who didn't think it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby (mean=2.08) was lower than the average frequency of prayer for survey respondents who thought it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby (mean=3.05). Frequency of prayer is an ordinal level variable that is coded so that higher numeric values are associated with survey respondents who prayed less often.

The relationship that "survey respondents who didn't think it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby prayed more often than survey respondents who thought it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby" is supported.

Page 79: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 79

Prior Probabilities for Groups

.831 64 64.000

.169 13 13.000

1.000 77 77.000

ABDEFECT1

2

Total

Prior Unweighted Weighted

Cases Used in Analysis

CLASSIFICATION USING THE DISCRIMINANT MODEL:

by chance accuracy rate

The independent variables could be characterized as useful predictors of membership in the groups defined by the dependent variable if the cross-validated classification accuracy rate was significantly higher than the accuracy attainable by chance alone. Operationally, the cross-validated classification accuracy rate should be 25% or more higher than the proportional by chance accuracy rate.

The proportional by chance accuracy rate of was computed by squaring and summing the proportion of cases in each group from the table of prior probabilities for groups (0.831² + 0.169² = 0.719).

Page 80: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 80

Classification Resultsb,c

72 0 72

15 0 15

48 0 48

100.0 .0 100.0

100.0 .0 100.0

100.0 .0 100.0

72 0 72

15 0 15

100.0 .0 100.0

100.0 .0 100.0

ABDEFECT1

2

Ungrouped cases

1

2

Ungrouped cases

1

2

1

2

Count

%

Count

%

Original

Cross-validateda

1 2

Predicted GroupMembership

Total

Cross validation is done only for those cases in the analysis. In crossvalidation, each case is classified by the functions derived from all cases otherthan that case.

a.

82.8% of original grouped cases correctly classified.b.

82.8% of cross-validated grouped cases correctly classified.c.

CLASSIFICATION USING THE DISCRIMINANT MODEL:

criteria for classification accuracy

The cross-validated accuracy rate computed by SPSS was 82.8% which was less than the proportional by chance accuracy criteria of 89.9% (1.25 x 71.9% = 89.9%).

The criteria for classification accuracy is not satisfied.

Page 81: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 81

Answering the question in problem 2

From the list of variables "respondent's degree of religious fundamentalism" [fund], "frequency of prayer" [pray], and "frequency of attendance at religious services" [attend], the most useful predictor for distinguishing between groups based on responses to "attitude toward abortion when there is a strong chance of serious defect in the baby" [abdefect] is "frequency of prayer" [pray]. These predictors differentiate survey respondents who thought it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby from survey respondents who didn't think it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby.

The most important predictor of groups based on responses to attitude toward abortion when there is a strong chance of serious defect in the baby was frequency of prayer.

Survey respondents who didn't think it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby prayed more often than survey respondents who thought it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby.

1. True 2. True with caution 3. False 4. Inappropriate application of a statistic

We found one statistically significant discriminant function, making it possible to distinguish among the two groups defined by the dependent variable.

However, the cross-validated classification accuracy was not 25% greater than the by chance accuracy rate, failing to support the utility of the model.

The answer to the question is false.

Page 82: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 82

Problem 3

In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data. Use a level of significance of 0.01 for evaluating assumptions. Use a level of significance of 0.05 for evaluating the statistical relationship.

From the list of variables "number of hours worked in the past week" [hrs1], "self-employment" [wrkslf], "highest year of school completed" [educ], and "income" [rincom98], the most useful predictors for distinguishing among groups based on responses to "opinion about spending on welfare" [natfare] are "number of hours worked in the past week" [hrs1], "self-employment" [wrkslf], and "highest year of school completed" [educ]. These predictors differentiate survey respondents who thought we spend too much money on welfare from survey respondents who thought we spend about the right amount of money on welfare who, in turn, are differentiated from survey respondents who thought we spend too little money on welfare.

The most important predictor of groups based on responses to opinion about spending on welfare was number of hours worked in the past week. The second most important predictor of groups based on responses to opinion about spending on welfare was self-employment. The third most important predictor of groups based on responses to opinion about spending on welfare was highest year of school completed.

Survey respondents who thought we spend about the right amount of money on welfare worked fewer hours in the past week than survey respondents who thought we spend too much or little money on welfare. Survey respondents who thought we spend about the right amount of money on welfare had completed more years of school than survey respondents who thought we spend too much or little money on welfare. Survey respondents who thought we spend too much money on welfare were more likely to be self-employed than survey respondents who thought we spend too little money on welfare.

1. True2. True with caution3. False4. Inappropriate application of a statistic

Page 83: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 83

Dissecting problem 3 - 1

In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data. Use a level of significance of 0.01 for evaluating assumptions. Use a level of significance of 0.05 for evaluating the statistical relationship.

From the list of variables "number of hours worked in the past week" [hrs1], "self-employment" [wrkslf], "highest year of school completed" [educ], and "income" [rincom98], the most useful predictors for distinguishing among groups based on responses to "opinion about spending on welfare" [natfare] are "number of hours worked in the past week" [hrs1], "self-employment" [wrkslf], and "highest year of school completed" [educ]. These predictors differentiate survey respondents who thought we spend too much money on welfare from survey respondents who thought we spend about the right amount of money on welfare who, in turn, are differentiated from survey respondents who thought we spend too little money on welfare.

The most important predictor of groups based on responses to opinion about spending on welfare was number of hours worked in the past week. The second most important predictor of groups based on responses to opinion about spending on welfare was self-employment. The third most important predictor of groups based on responses to opinion about spending on welfare was highest year of school completed.

When a problem asks us to identify the best or most useful predictors from a list of independent variables, we do stepwise discriminant analysis.

The variables listed first in the problem statement are the independent variables (IVs): "number of hours worked in the past week" [hrs1], "self-employment" [wrkslf], "highest year of school completed" [educ], and "income" [rincom98].

The variable used to define groups is the dependent variable (DV): "opinion about spending on welfare" [natfare].

Page 84: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 84

Dissecting problem 3 - 2

In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data. Use a level of significance of 0.01 for evaluating assumptions. Use a level of significance of 0.05 for evaluating the statistical relationship.

From the list of variables "number of hours worked in the past week" [hrs1], "self-employment" [wrkslf], "highest year of school completed" [educ], and "income" [rincom98], the most useful predictors for distinguishing among groups based on responses to "opinion about spending on welfare" [natfare] are "number of hours worked in the past week" [hrs1], "self-employment" [wrkslf], and "highest year of school completed" [educ]. These predictors differentiate survey respondents who thought we spend too much money on welfare from survey respondents who thought we spend about the right amount of money on welfare who, in turn, are differentiated from survey respondents who thought we spend too little money on welfare.

The most important predictor of groups based on responses to opinion about spending on welfare was number of hours worked in the past week. The second most important predictor of groups based on responses to opinion about spending on welfare was self-employment. The third most important predictor of groups based on responses to opinion about spending on welfare was highest year of school completed.

The problem identifies three groups for the dependent variable:•survey respondents who thought we spend too much money on welfare •survey respondents who thought we spend about the right amount of money on welfare •survey respondents who thought we spend too little money on welfare.

To distinguish among three groups, the analysis will be required to find two statistically significant discriminant functions.

The importance of predictors is based upon the stepwise addition of variables to the analysis.

Page 85: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 85

Dissecting problem 3 - 3

The most important predictor of groups based on responses to opinion about spending on welfare was number of hours worked in the past week. The second most important predictor of groups based on responses to opinion about spending on welfare was self-employment. The third most important predictor of groups based on responses to opinion about spending on welfare was highest year of school completed.

Survey respondents who thought we spend about the right amount of money on welfare worked fewer hours in the past week than survey respondents who thought we spend too much or little money on welfare. Survey respondents who thought we spend about the right amount of money on welfare had completed more years of school than survey respondents who thought we spend too much or little money on welfare. Survey respondents who thought we spend too much money on welfare were more likely to be self-employed than survey respondents who thought we spend too little money on welfare.

1. True2. True with caution3. False4. Inappropriate application of a statistic

The specific relationships listed in the problem indicate how the independent variable relates to groups of the dependent variable, i.e., the mean for hours worked in the past week will be lower for respondents who think we spend the right amount of money versus respondents who think we spend too much or too little.

In a stepwise analysis, we only interpret the independent variables that are entered in the stepwise analysis.

In order for a stepwise analysis to be true, we must have enough statistically significant functions to distinguish among the groups, the order of entry must be correct, and each significant relationship must be interpreted correctly.

Page 86: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 86

LEVEL OF MEASUREMENT - 1

From the list of variables "number of hours worked in the past week" [hrs1], "self-employment" [wrkslf], "highest year of school completed" [educ], and "income" [rincom98], the most useful predictors for distinguishing among groups based on responses to "opinion about spending on welfare" [natfare] are "number of hours worked in the past week" [hrs1], "self-employment" [wrkslf], and "highest year of school completed" [educ]. These predictors differentiate survey respondents who thought we spend too much money on welfare from survey respondents who thought we spend about the right amount of money on welfare who, in turn, are differentiated from survey respondents who thought we spend too little money on welfare.

The most important predictor of groups based on responses to opinion about spending on welfare was number of hours worked in the past week. The second most important predictor of groups based on responses to opinion about spending on welfare was self-employment. The third most important predictor of groups based on responses to opinion about spending on welfare was highest year of school completed.

Survey respondents who thought we spend about the right amount of money on welfare worked fewer hours in the past week than survey respondents who thought we spend too much or little money on welfare. Survey respondents who thought we spend about the right amount of money on welfare had completed more years of school than survey respondents who thought we spend too much or little money on welfare. Survey respondents who thought we spend too much money on welfare were more likely to be self-employed than survey respondents who thought we spend too little money on welfare.

Discriminant analysis requires that the dependent variable be non-metric and the independent variables be metric or dichotomous. "Opinion about spending on welfare" [natfare] is an ordinal level variable, which satisfies the level of measurement requirement.

It contains three categories: survey respondents who thought we spend too much money on welfare, survey respondents who thought we spend about the right amount of money on welfare, and survey respondents who thought we spend too little money on welfare.

Page 87: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 87

LEVEL OF MEASUREMENT - 2

From the list of variables "number of hours worked in the past week" [hrs1], "self-employment" [wrkslf], "highest year of school completed" [educ], and "income" [rincom98], the most useful predictors for distinguishing among groups based on responses to "opinion about spending on welfare" [natfare] are "number of hours worked in the past week" [hrs1], "self-employment" [wrkslf], and "highest year of school completed" [educ]. These predictors differentiate survey respondents who thought we spend too much money on welfare from survey respondents who thought we spend about the right amount of money on welfare who, in turn, are differentiated from survey respondents who thought we spend too little money on welfare.

The most important predictor of groups based on responses to opinion about spending on welfare was number of hours worked in the past week. The second most important predictor of groups based on responses to opinion about spending on welfare was self-employment. The third most important predictor of groups based on responses to opinion about spending on welfare was highest year of school completed.

Survey respondents who thought we spend about the right amount of money on welfare worked fewer hours in the past week than survey respondents who thought we spend too much or little money on welfare. Survey respondents who thought we spend about the right amount of money on welfare had completed more years of school than survey respondents who thought we spend too much or little money on welfare. Survey respondents who thought we spend too much money on welfare were more likely to be self-employed than survey respondents who thought we spend too little money on welfare.

"Income" [rincom98] is an ordinal level variable. If we follow the convention of treating ordinal level variables as metric variables, the level of measurement requirement for discriminant analysis is satisfied. Since some data analysts do not agree with this convention, a note of caution should be included in our interpretation.

"Number of hours worked in the past week" [hrs1] and "highest year of school completed" [educ] are interval level variables, which satisfies the level of measurement requirements for discriminant analysis.

"Self-employment" [wrkslf] is a dichotomous or dummy-coded nominal variable which may be included in discriminant analysis.

Page 88: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 88

The stepwise discriminant analysis

To answer the question, we do a stepwise discriminant analysis with natfare as the dependent variable and hrs1, wkrslf, educ, and rincom98, and as the independent variables.

Select the Classify | Discriminant… command from the Analyze menu.

Page 89: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 89

Selecting the dependent variable

Second, click on the right arrow button to move the dependent variable to the Grouping Variable text box.

First, highlight the dependent variable natfare in the list of variables.

Page 90: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 90

Defining the group values

When SPSS moves the dependent variable to the Grouping Variable textbox, it puts two question marks in parentheses after the variable name. This is a reminder that we have to enter the number that represent the groups we want to include in the analysis.

First, to specify the group numbers, click on the Define Range… button.

Page 91: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 91

Completing the range of group values

The value labels for natfare show three categories:

1 = TOO LITTLE2 = ABOUT RIGHT3 = TOO MUCH

The range of values that we need to enter goes from 1 as the minimum and 3 as the maximum.

Third, click on the Continue button to close the dialog box.

First, type in 1 in the Minimum text box.

Second, type in 3 in the Maximum text box.

Note: if we enter the wrong range of group numbers, e.g., 1 to 2 instead of 1 to 3, SPSS will only include groups 1 and 2 in the analysis.

Page 92: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 92

Specifying the method for including variables

SPSS provides us with two methods for including variables: to enter all of the independent variables at one time, and a stepwise method for selecting variables using a statistical test to determine the order in which variables are included.

Since the problem calls for identifying the best predictors, we click on the option button to Use stepwise method.

Page 93: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 93

Requesting statistics for the output

Click on the Statistics… button to select statistics we will need for the analysis.

Page 94: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 94

Specifying statistical output

Fourth, click on the Continue button to close the dialog box.

First, mark the Means checkbox on the Descriptives panel. We will use the group means in our interpretation.

Second, mark the Univariate ANOVAs checkbox on the Descriptives panel. Perusing these tests suggests which variables might be useful descriminators.

Third, mark the Box’s M checkbox. Box’s M statistic evaluates conformity to the assumption of homogeneity of group variances.

Page 95: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 95

Specifying details for the stepwise method

Click on the Method… button to specify the specific statistical criteria to use for including variables.

Page 96: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 96

Details for the stepwise method

Third, click on the option button Use probability of F so that we can incorporate the level of significance specified in the problem.

First, mark the Mahalanobis distance option button on the Method panel.

Third, click on the Continue button to close the dialog box.

Second, mark the Summary of steps checkbox to produce a summary table when a new variable is added.

Fourth, type the level of significance in the Entry text box. The Removal value is twice as large as the entry value.

Page 97: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 97

Specifying details for classification

Click on the Classify… button to specify details for the classification phase of the analysis.

Page 98: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 98

Details for classification - 1

Third, mark the Summary table checkbox to include summary tables comparing actual and predicted classification.

First, mark the option button to Compute from group sizes on the Prior Probabilities panel. This incorporates the size of the groups defined by the dependent variable into the classification of cases using the discriminant functions.

Second, mark the Casewise results checkbox on the Display panel to include classification details for each case in the output.

Page 99: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 99

Details for classification - 2

Fourth, mark the Leave-one-out classification checkbox to request SPSS to include a cross-validated classification in the output. This option produces a less biased estimate of classification accuracy by sequentially holding each case out of the calculations for the discriminant functions, and using the derived functions to classify the case held out.

Page 100: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 100

Details for classification - 3

Sixth, mark the Combined-groups checkbox on the Plots panel to obtain a visual plot of the relationship between functions and groups defined by the dependent variable.

Fifth, accept the default of Within-groups option button on the Use Covariance Matrix panel. The Covariance matrices are the measure of the dispersion in the groups defined by the dependent variable. If we fail the homogeneity of group variances test (Box’s M), our option is use Separate groups covariance in classification.

Seventh, click on the Continue button to close the dialog box.

Page 101: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 101

Completing the discriminant analysis request

Click on the OK button to request the output for the disciminant analysis.

Page 102: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 102

Analysis Case Processing Summary

138 51.1

7 2.6

115 42.6

10 3.7

132 48.9

270 100.0

Unweighted CasesValid

Missing or out-of-rangegroup codes

At least one missingdiscriminating variable

Both missing orout-of-range group codesand at least one missingdiscriminating variable

Total

Excluded

Total

N Percent

SAMPLE SIZE - 1

The minimum ratio of valid cases to independent variables for discriminant analysis is 5 to 1, with a preferred ratio of 20 to 1. In this analysis, there are 138 valid cases and 4 independent variables.

The ratio of cases to independent variables is 34.5 to 1, which satisfies the minimum requirement. In addition, the ratio of 34.5 to 1 satisfies the preferred ratio of 20 to 1.

Page 103: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 103

Prior Probabilities for Groups

.409 56 56.000

.358 49 49.000

.234 32 32.000

1.000 137 137.000

WELFARE1 TOO LITTLE

2 ABOUT RIGHT

3 TOO MUCH

Total

Prior Unweighted Weighted

Cases Used in Analysis

SAMPLE SIZE - 2

In addition to the requirement for the ratio of cases to independent variables, discriminant analysis requires that there be a minimum number of cases in the smallest group defined by the dependent variable. The number of cases in the smallest group must be larger than the number of independent variables, and preferably contain 20 or more cases.

The number of cases in the smallest group in this problem is 32, which is larger than the number of independent variables (4), satisfying the minimum requirement. In addition, the number of cases in the smallest group satisfies the preferred minimum of 20 cases.

Page 104: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 104

NUMBER OF DISCRIMINANT FUNCTIONS - 1

The maximum possible number of discriminant functions is the smaller of one less than the number of groups defined by the dependent variable and the number of independent variables.

In this analysis there were 3 groups defined by opinion about spending on welfare and 4 independent variables, so the maximum possible number of discriminant functions was 2.

Page 105: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 105

NUMBER OF DISCRIMINANT FUNCTIONS - 2

In the table of Wilks' Lambda which tested functions for statistical significance, the stepwise analysis identified 2 discriminant functions that were statistically significant. The Wilks' lambda statistic for the test of function 1 through 2 functions (chi-square=21.853) had a probability of 0.001 which was less than or equal to the level of significance of 0.05.

After removing function 1, the Wilks' lambda statistic for the test of function 2 (chi-square=7.074) had a probability of 0.029 which was less than or equal to the level of significance of 0.05. The significance of the maximum possible number of discriminant functions supports the interpretation of a solution using 2 discriminant functions.

Page 106: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 106

Independent variables and group membership:

relationship of functions to groups

Functions at Group Centroids

-.220 .235

.446 -.031

-.311 -.362

WELFARE1

2

3

1 2

Function

Unstandardized canonical discriminantfunctions evaluated at group means

In order to specify the role that each independent variable plays in predicting group membership on the dependent variable, we must link together the relationship between the discriminant functions and the groups defined by the dependent variable, the role of the significant independent variables in the discriminant functions, and the differences in group means for each of the variables.

Function 1 separates survey respondents who thought we spend about the right amount of money on welfare (the positive value of 0.446) from survey respondents who thought we spend too much (negative value of -0.311) or little money (negative value of -0.220) on welfare.

Function 2 separates survey respondents who thought we spend too little money on welfare (positive value of 0.235) from survey respondents who thought we spend too much money (negative value of -0.362) on welfare. We ignore the second group (-0.031) in this comparison because it was distinguished from the other two groups by function 1.

Page 107: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 107

Variables Entered/Removeda,b,c,d

NUMBEROFHOURSWORKEDLASTWEEK

.023 1 and 3 .475 1 135.000 .492

RSELF-EMP ORWORKSFORSOMEBODY

.251 1 and 2 3.289 2 134.000 .040

HIGHESTYEAR OFSCHOOLCOMPLETED

.364 1 and 3 2.433 3 133.000 .068

Step1

2

3

Entered StatisticBetweenGroups Statistic df1 df2 Sig.

Exact F

Min. D Squared

At each step, the variable that maximizes the Mahalanobis distance between the two closestgroups is entered.

Maximum number of steps is 8.a.

Maximum significance of F to enter is .05.b.

Minimum significance of F to remove is .10.c.

Independent variables and group membership:

which predictors to interpret

When we use the stepwise method of variable inclusion, we limit our interpretation of independent variable predictors to those listed as statistically significant in the table of Variables Entered/Removed.

We will interpret the impact on membership in groups defined by the dependent variable by the independent variables:

•number of hours worked in the past week•self-employment. •highest year of school completed

Had we use simultaneous entry of all variables, we would not have imposed this limitation.

Page 108: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 108

Structure Matrix

.687* .136

-.582* .345

.223 .889*

.101 .292*

HIGHEST YEAR OFSCHOOL COMPLETED

NUMBER OF HOURSWORKED LAST WEEK

R SELF-EMP OR WORKSFOR SOMEBODY

RESPONDENTS INCOMEa

1 2

Function

Pooled within-groups correlations between discriminatingvariables and standardized canonical discriminant functions Variables ordered by absolute size of correlation within function.

Largest absolute correlation between each variable andany discriminant function

*.

This variable not used in the analysis.a.

Independent variables and group membership:

predictor loadings on functions

Based on the structure matrix, the predictor variables strongly associated with discriminant function 1 which distinguished between survey respondents who thought we spend about the right amount of money on welfare and survey respondents who thought we spend too much or little money on welfare were number of hours worked in the past week (r=-0.582) and highest year of school completed (r=0.687).

Based on the structure matrix, the predictor variable strongly associated with discriminant function 2 which distinguished between survey respondents who thought we spend too little money on welfare and survey respondents who thought we spend too much money on welfare was self-employment (r=0.889).

We do not interpret loadings in the structure matrix unless they are 0.30 or higher.

Page 109: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 109

Group Statistics

43.96 13.240 56 56.000

13.73 2.401 56 56.000

1.93 .260 56 56.000

13.70 5.034 56 56.000

37.90 13.235 50 50.000

14.78 2.558 50 50.000

1.90 .303 50 50.000

14.00 5.503 50 50.000

42.03 10.456 32 32.000

13.38 2.524 32 32.000

1.75 .440 32 32.000

14.75 5.304 32 32.000

41.32 12.846 138 138.000

NUMBER OF HOURSWORKED LAST WEEK

HIGHEST YEAR OFSCHOOL COMPLETED

R SELF-EMP OR WORKSFOR SOMEBODY

RESPONDENTS INCOME

NUMBER OF HOURSWORKED LAST WEEK

HIGHEST YEAR OFSCHOOL COMPLETED

R SELF-EMP OR WORKSFOR SOMEBODY

RESPONDENTS INCOME

NUMBER OF HOURSWORKED LAST WEEK

HIGHEST YEAR OFSCHOOL COMPLETED

R SELF-EMP OR WORKSFOR SOMEBODY

RESPONDENTS INCOME

NUMBER OF HOURSWORKED LAST WEEK

WELFARE1 TOO LITTLE

2 ABOUT RIGHT

3 TOO MUCH

Total

Mean Std. Deviation Unweighted Weighted

Valid N (listwise)

Independent variables and group membership:

predictors associated with first function - 1

The average number of hours worked in the past week for survey respondents who thought we spend about the right amount of money on welfare (mean=37.90) was lower than the average number of hours worked in the past weeks for survey respondents who thought we spend too little money on welfare (mean=43.96) and survey respondents who thought we spend too much money on welfare (mean=42.03).

This supports the relationship that "survey respondents who thought we spend about the right amount of money on welfare worked fewer hours in the past week than survey respondents who thought we spend too little or much money on welfare."

Page 110: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 110

Group Statistics

43.96 13.240 56 56.000

13.73 2.401 56 56.000

1.93 .260 56 56.000

13.70 5.034 56 56.000

37.90 13.235 50 50.000

14.78 2.558 50 50.000

1.90 .303 50 50.000

14.00 5.503 50 50.000

42.03 10.456 32 32.000

13.38 2.524 32 32.000

1.75 .440 32 32.000

14.75 5.304 32 32.000

41.32 12.846 138 138.000

NUMBER OF HOURSWORKED LAST WEEK

HIGHEST YEAR OFSCHOOL COMPLETED

R SELF-EMP OR WORKSFOR SOMEBODY

RESPONDENTS INCOME

NUMBER OF HOURSWORKED LAST WEEK

HIGHEST YEAR OFSCHOOL COMPLETED

R SELF-EMP OR WORKSFOR SOMEBODY

RESPONDENTS INCOME

NUMBER OF HOURSWORKED LAST WEEK

HIGHEST YEAR OFSCHOOL COMPLETED

R SELF-EMP OR WORKSFOR SOMEBODY

RESPONDENTS INCOME

NUMBER OF HOURSWORKED LAST WEEK

WELFARE1 TOO LITTLE

2 ABOUT RIGHT

3 TOO MUCH

Total

Mean Std. Deviation Unweighted Weighted

Valid N (listwise)

Independent variables and group membership:

predictors associated with first function - 2

The average highest year of school completed for survey respondents who thought we spend about the right amount of money on welfare (mean=14.78) was higher than the average highest year of school completeds for survey respondents who thought we spend too little money on welfare (mean=13.73) and survey respondents who thought we spend too much money on welfare (mean=13.38).

This supports the relationship that "survey respondents who thought we spend about the right amount of money on welfare had completed more years of school than survey respondents who thought we spend too little or much money on welfare."

Page 111: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 111

Group Statistics

43.96 13.240 56 56.000

13.73 2.401 56 56.000

1.93 .260 56 56.000

13.70 5.034 56 56.000

37.90 13.235 50 50.000

14.78 2.558 50 50.000

1.90 .303 50 50.000

14.00 5.503 50 50.000

42.03 10.456 32 32.000

13.38 2.524 32 32.000

1.75 .440 32 32.000

14.75 5.304 32 32.000

41.32 12.846 138 138.000

NUMBER OF HOURSWORKED LAST WEEK

HIGHEST YEAR OFSCHOOL COMPLETED

R SELF-EMP OR WORKSFOR SOMEBODY

RESPONDENTS INCOME

NUMBER OF HOURSWORKED LAST WEEK

HIGHEST YEAR OFSCHOOL COMPLETED

R SELF-EMP OR WORKSFOR SOMEBODY

RESPONDENTS INCOME

NUMBER OF HOURSWORKED LAST WEEK

HIGHEST YEAR OFSCHOOL COMPLETED

R SELF-EMP OR WORKSFOR SOMEBODY

RESPONDENTS INCOME

NUMBER OF HOURSWORKED LAST WEEK

WELFARE1 TOO LITTLE

2 ABOUT RIGHT

3 TOO MUCH

Total

Mean Std. Deviation Unweighted Weighted

Valid N (listwise)

Independent variables and group membership:

predictors associated with second function

Since self-employment is a dichotomous variable, the mean is not directly interpretable. Its interpretation must take into account the coding by which 1 corresponds to self-employed and 2 corresponds to someone else. The lower mean for survey respondents who thought we spend too much money on welfare (mean=1.75), when compared to the mean for survey respondents who thought we spend too little money on welfare (mean=1.93), implies that the group contained more survey respondents who were self-employed and fewer survey respondents who were working for someone else.

This supports the relationship that "survey respondents who thought we spend too much money on welfare were more likely to be self-employed than survey respondents who thought we spend too little money on welfare."

Page 112: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 112

Prior Probabilities for Groups

.406 56 56.000

.362 50 50.000

.232 32 32.000

1.000 138 138.000

WELFARE1 TOO LITTLE

2 ABOUT RIGHT

3 TOO MUCH

Total

Prior Unweighted Weighted

Cases Used in Analysis

CLASSIFICATION USING THE DISCRIMINANT MODEL:

by chance accuracy rate

The independent variables could be characterized as useful predictors of membership in the groups defined by the dependent variable if the cross-validated classification accuracy rate was significantly higher than the accuracy attainable by chance alone. Operationally, the cross-validated classification accuracy rate should be 25% or more higher than the proportional by chance accuracy rate.

The proportional by chance accuracy rate of was computed by squaring and summing the proportion of cases in each group from the table of prior probabilities for groups (0.406² + 0.362² + 0.232² = 0.350).

Page 113: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 113

Classification Resultsb,c

43 15 6 64

26 30 6 62

17 10 9 36

3 3 2 8

67.2 23.4 9.4 100.0

41.9 48.4 9.7 100.0

47.2 27.8 25.0 100.0

37.5 37.5 25.0 100.0

43 15 6 64

26 30 6 62

17 11 8 36

67.2 23.4 9.4 100.0

41.9 48.4 9.7 100.0

47.2 30.6 22.2 100.0

WELFARE1 TOO LITTLE

2 ABOUT RIGHT

3 TOO MUCH

Ungrouped cases

1 TOO LITTLE

2 ABOUT RIGHT

3 TOO MUCH

Ungrouped cases

1 TOO LITTLE

2 ABOUT RIGHT

3 TOO MUCH

1 TOO LITTLE

2 ABOUT RIGHT

3 TOO MUCH

Count

%

Count

%

Original

Cross-validateda

1 TOOLITTLE

2 ABOUTRIGHT 3 TOO MUCH

Predicted Group Membership

Total

Cross validation is done only for those cases in the analysis. In cross validation, each case isclassified by the functions derived from all cases other than that case.

a.

50.6% of original grouped cases correctly classified.b.

50.0% of cross-validated grouped cases correctly classified.c.

CLASSIFICATION USING THE DISCRIMINANT MODEL:

criteria for classification accuracy

The cross-validated accuracy rate computed by SPSS was 50.0% which was greater than or equal to the proportional by chance accuracy criteria of 43.7% (1.25 x 35.0% = 43.7%). The criteria for classification accuracy is satisfied.

Page 114: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 114

From the list of variables "number of hours worked in the past week" [hrs1], "self-employment" [wrkslf], "highest year of school completed" [educ], and "income" [rincom98], the most useful predictors for distinguishing among groups based on responses to "opinion about spending on welfare" [natfare] are "number of hours worked in the past week" [hrs1], "self-employment" [wrkslf], and "highest year of school completed" [educ]. These predictors differentiate survey respondents who thought we spend too much money on welfare from survey respondents who thought we spend about the right amount of money on welfare who, in turn, are differentiated from survey respondents who thought we spend too little money on welfare.

The most important predictor of groups based on responses to opinion about spending on welfare was number of hours worked in the past week. The second most important predictor of groups based on responses to opinion about spending on welfare was self-employment. The third most important predictor of groups based on responses to opinion about spending on welfare was highest year of school completed.

Survey respondents who thought we spend about the right amount of money on welfare worked fewer hours in the past week than survey respondents who thought we spend too much or little money on welfare. Survey respondents who thought we spend about the right amount of money on welfare had completed more years of school than survey respondents who thought we spend too much or little money on welfare. Survey respondents who thought we spend too much money on welfare were more likely to be self-employed than survey respondents who thought we spend too little money on welfare.

Answering the question in problem 3 - 1

The stepwise discriminant analysis included the three variables identified as the most use predictors.

Page 115: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 115

From the list of variables "number of hours worked in the past week" [hrs1], "self-employment" [wrkslf], "highest year of school completed" [educ], and "income" [rincom98], the most useful predictors for distinguishing among groups based on responses to "opinion about spending on welfare" [natfare] are "number of hours worked in the past week" [hrs1], "self-employment" [wrkslf], and "highest year of school completed" [educ]. These predictors differentiate survey respondents who thought we spend too much money on welfare from survey respondents who thought we spend about the right amount of money on welfare who, in turn, are differentiated from survey respondents who thought we spend too little money on welfare.

The most important predictor of groups based on responses to opinion about spending on welfare was number of hours worked in the past week. The second most important predictor of groups based on responses to opinion about spending on welfare was self-employment. The third most important predictor of groups based on responses to opinion about spending on welfare was highest year of school completed.

Survey respondents who thought we spend about the right amount of money on welfare worked fewer hours in the past week than survey respondents who thought we spend too much or little money on welfare. Survey respondents who thought we spend about the right amount of money on welfare had completed more years of school than survey respondents who thought we spend too much or little money on welfare. Survey respondents who thought we spend too much money on welfare were more likely to be self-employed than survey respondents who thought we spend too little money on welfare.

Answering the question in problem 3 - 2

We found two statistically significant discriminant functions, making it possible to distinguish among the three groups defined by the dependent variable.

Moreover, the cross-validated classification accuracy surpassed the by chance accuracy criteria, supporting the utility of the model.

Page 116: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 116

From the list of variables "number of hours worked in the past week" [hrs1], "self-employment" [wrkslf], "highest year of school completed" [educ], and "income" [rincom98], the most useful predictors for distinguishing among groups based on responses to "opinion about spending on welfare" [natfare] are "number of hours worked in the past week" [hrs1], "self-employment" [wrkslf], and "highest year of school completed" [educ]. These predictors differentiate survey respondents who thought we spend too much money on welfare from survey respondents who thought we spend about the right amount of money on welfare who, in turn, are differentiated from survey respondents who thought we spend too little money on welfare.

The most important predictor of groups based on responses to opinion about spending on welfare was number of hours worked in the past week. The second most important predictor of groups based on responses to opinion about spending on welfare was self-employment. The third most important predictor of groups based on responses to opinion about spending on welfare was highest year of school completed.

Survey respondents who thought we spend about the right amount of money on welfare worked fewer hours in the past week than survey respondents who thought we spend too much or little money on welfare. Survey respondents who thought we spend about the right amount of money on welfare had completed more years of school than survey respondents who thought we spend too much or little money on welfare. Survey respondents who thought we spend too much money on welfare were more likely to be self-employed than survey respondents who thought we spend too little money on welfare.

Answering the question in problem 3 - 3

The order of importance matched the order of entry in the table of "Variables Entered/Removed."

Page 117: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 117

The most important predictor of groups based on responses to opinion about spending on welfare was number of hours worked in the past week. The second most important predictor of groups based on responses to opinion about spending on welfare was self-employment. The third most important predictor of groups based on responses to opinion about spending on welfare was highest year of school completed.

Survey respondents who thought we spend about the right amount of money on welfare worked fewer hours in the past week than survey respondents who thought we spend too much or little money on welfare. Survey respondents who thought we spend about the right amount of money on welfare had completed more years of school than survey respondents who thought we spend too much or little money on welfare. Survey respondents who thought we spend too much money on welfare were more likely to be self-employed than survey respondents who thought we spend too little money on welfare.

1. True2. True with caution3. False4. Inappropriate application of a statistic

Answering the question in problem 3 - 4

We verified that each statement about the relationship between predictors and groups was correct.

The answer to the question is true with caution. A caution is added because of the inclusion of ordinal level variables.

Page 118: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 118

Steps in discriminant analysis: level of measurement and initial sample

size

The following is a guide to the decision process for answering problems about the basic relationships in discriminant analysis:

Inappropriate application of a statistic

Yes

NoDependent non-metric?Independent variables metric or dichotomous?

Yes

Ratio of cases to independent variables at least 5 to 1?

Yes

No Inappropriate application of a statistic

Yes

Number of cases in smallest group greater than number of independent variables?

Yes

No Inappropriate application of a statistic

Page 119: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 119

Steps in discriminant analysis: usable discriminant model

Yes

Sufficient statistically significant functions to distinguish DV groups?

NoFalse

Run discriminant analysis, using method for including variables identified in the research question.

Page 120: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 120

Steps in discriminant analysis: relationships between IV's and DV

Stepwise method of entry used to include independent variables?

Yes

No

Entry order of variables interpreted correctly?

YesFalse

Relationships between individual IVs and DV groups interpreted correctly?

No

Yes

False

No

Page 121: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 121

Steps in discriminant analysis: classification accuracy

Yes

Cross-validated accuracy is 25% higher than proportional by chance accuracy rate?

Yes

NoFalse

Page 122: DiscriminantAnalysis_BasicRelationships

SW388R7Data Analysis

& Computers II

Slide 122

Steps in discriminant analysis: adding cautions to solution

DV is non-metric level and IVs are interval level or dichotomous (not ordinal)?

Yes

No

True

Yes

Satisfies preferred ratio of cases to IV's of 20 to 1

Yes

NoTrue with caution

Yes

Satisfies preferred DV group minimum size of 20 cases?

Yes

NoTrue with caution

True with caution