principal component analysis

69
SW388R7 Data Analysis & Computers II Slide 1 Principal component analysis Principal component analysis Strategy for solving problems Sample problems Steps in principal component analysis

Upload: kaja-kuhar

Post on 21-May-2015

470 views

Category:

Data & Analytics


0 download

DESCRIPTION

Guide for principal component analysis (SPSS)

TRANSCRIPT

Page 1: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 1

Principal component analysis

Principal component analysis

Strategy for solving problems

Sample problems

Steps in principal component analysis

Page 2: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 2

Principal components factor analysis

Obtaining a factor solution through principal components analysis is an iterative process that usually requires repeating the SPSS factor analysis procedure a number of times to reach a satisfactory solution.

We begin by identifying a group of variables whose variance we believe can be represented more parsimoniously by a smaller set of factors, or components. The end result of the principal components analysis will tell us which variables can be represented by which components, and which variables should be retained as individual variables because the factor solution does not adequately represent their information.

Page 3: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 3

Strategy for solving problems - 1

A principal component factor analysis requires: The variables included must be metric level or

dichotomous (dummy-coded) nominal level The sample size must be greater than 50 (preferably

100) The ratio of cases to variables must be 5 to 1 or larger The correlation matrix for the variables must contain 2 or

more correlations of 0.30 or greater Variables with measures of sampling adequacy less than

0.50 must be removed The overall measure of sampling adequacy is 0.50 or

higher The Bartlett test of sphericity is statistically significant.

The first phase of a principal component analysis is devoted to verifying that we meet these requirements. If we do not meet these requirements, factor analysis is not appropriate.

Page 4: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 4

Strategy for solving problems - 2

The second phase of a principal component factor analysis focuses on deriving a factor model, or pattern of relationships between variables and components, that satisfies the following requirements: The derived components explain 50% or more of the

variance in each of the variables, i.e. have a communality greater than 0.50

None of the variables have loadings, or correlations, of 0.40 or higher for more than one component, i.e. do not have complex structure

None of the components has only one variable in it

To meet these requirements, we remove problematic variables from the analysis and repeat the principal component analysis.

Page 5: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 5

Strategy for solving problems - 3

If, at the conclusion of this process, we have components that have more than one variable loading on them, have components that explain at least 50% of the variance in the included variables, and have components that collectively explain more than 60% of the variance in the set of variables, we can substitute the components for the variables in further analyses.

Variables that were removed in the analysis should be included individually in further analyses.

Substitution of components for individual variables is accomplished by using only the highest loading variable, or by combining the variables loading on each component to create a new variable.

Page 6: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 6

Notes - 1

When evaluating measures of sampling adequacy, communalities, or factor loadings, we ignore the sign of the numeric value and base our decision on the size or magnitude of the value.

The sign of the number indicates the direction of the relationship.

A loading of -0.732 is just as strong as a loading of 0.732. The minus sign indicates an inverse or negative relationship; the absence of a sign is meant to imply a plus sign indicating a direct or positive relationship.

Page 7: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 7

Notes - 2

If there are two or more components in the component matrix, the pattern of loadings is based on the SPSS Rotated Component Matrix. If there is only one component in the solution, the Rotated Component Matrix is not computed, and the pattern of loadings is based on the Component Matrix.

It is possible that the analysis will break down and we will have too few variables in the analysis to support the use of principal component analysis.

Page 8: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 8

Problem 1

In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Answer the question based on the results of a principal component analysis prior to testing for outliers, split sample validation, and a test of reliability. Assume that there is no problematic pattern of missing data. Use a level of significance of 0.05.

Based on the results of a principal component analysis of the 8 variables "claims about environmental threats are exaggerated" [grnexagg], "danger to the environment from modifying genes in crops" [genegen], "America doing enough to protect environment" [amprogrn], "should be international agreements for environment problems" [grnintl], "poorer countries should be expected to do less for the environment" [ldcgrn], "economic progress in America will slow down without more concern for environment" [econgrn], "likelihood of nuclear power station damaging environment in next 5 years" [nukeacc], and "respondent's socioeconomic index" [sei], the information in these variables can be represented with 2 components and 4 individual variables.

Component 1 includes the variables "danger to the environment from modifying genes in crops" [genegen] and "likelihood of nuclear power station damaging environment in next 5 years" [nukeacc]. Component 2 includes the variables "claims about environmental threats are exaggerated" [grnexagg] and "poorer countries should be expected to do less for the environment" [ldcgrn]. The variables "respondent's socioeconomic index" [sei], "economic progress in America will slow down without more concern for environment" [econgrn], "should be international agreements for environment problems" [grnintl], and "America doing enough to protect environment" [amprogrn] were not included on the components and are retained as individual variables.

1. True2. True with caution3. False4. Inappropriate application of a statistic

Page 9: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 9

Computing a principal component analysis

To compute a principal component analysis in SPSS, select the Data Reduction | Factor… command from the Analyze menu.

Page 10: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 10

Add the variables to the analysis

First, move the variables listed in the problem to the Variables list box.

Second, click on the Descriptives… button to specify statistics to include in the output.

Page 11: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 11

Compete the descriptives dialog box

First, mark the Univariate descriptives checkbox to get a tally of valid cases.

Third, mark the Coefficients checkbox to get a correlation matrix, one of the outputs needed to assess the appropriateness of factor analysis for the variables.

Second, keep the Initial solution checkbox to get the statistics needed to determine the number of factors to extract.

Fourth, mark the KMO and Bartlett’s test of sphericity checkbox to get more outputs used to assess the appropriateness of factor analysis for the variables.

Fifth, mark the Anti-image checkbox to get more outputs used to assess the appropriateness of factor analysis for the variables.

Sixth, click on the Continue button.

Page 12: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 12

Select the extraction method

First, click on the Extraction… button to specify statistics to include in the output.

The extraction method refers to the mathematical method that SPSS uses to compute the factors or components.

Page 13: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 13

Compete the extraction dialog box

First, retain the default method Principal components.

Second, click on the Continue button.

Page 14: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 14

Select the rotation method

First, click on the Rotation… button to specify statistics to include in the output.

The rotation method refers to the mathematical method that SPSS rotate the axes in geometric space. This makes it easier to determine which variables are loaded on which components.

Page 15: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 15

Compete the rotation dialog box

First, mark the Varimax method as the type of rotation to used in the analysis.

Second, click on the Continue button.

Page 16: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 16

Complete the request for the analysis

First, click on the OK button to request the output.

Page 17: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 17

Level of measurement requirement

"Respondent's socioeconomic index" [sei] is an interval level variable, which satisfy the level of measurement requirements for a principal component analysis.

"Claims about environmental threats are exaggerated" [grnexagg], "danger to the environment from modifying genes in crops" [genegen], "America doing enough to protect environment" [amprogrn], "should be international agreements for environment problems" [grnintl], "poorer countries should be expected to do less for the environment" [ldcgrn], "economic progress in America will slow down without more concern for environment" [econgrn], and "likelihood of nuclear power station damaging environment in next 5 years" [nukeacc] are ordinal level variables. If we follow the convention of treating ordinal level variables as metric variables, the level of measurement requirement for principal component analysis is satisfied. Since some data analysts do not agree with this convention, a note of caution should be included in our interpretation.

Page 18: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 18

Descriptive Statistics

3.31 .988 67

3.09 .949 67

2.40 .605 67

1.97 .674 67

3.73 .898 67

2.69 .839 67

2.49 .894 67

49.757 18.9651 67

ENVIRONMENTALTHREATSEXAGGERATED

HOW DANGEROUSMODIFYING GENES INCROPS

AMERICAN DOINGENOUGH TO PROTECTENVIRONMENT

INTL AGREEMENTS FORENVIRONMENTPROBLEMS

POOR COUNTRIES LESSTHAN RICH FORENVIRONMENT

ECONOMIC PROGRESSDEPENDENT ONENVIRONMENT

LIKELIHOOD OFNUCLEAR MELTDOWNIN 5 YEARS

RESPONDENT'SSOCIOECONOMIC INDEX

Mean Std. Deviation Analysis N

Sample size requirement:minimum number of cases

The number of valid cases for this set of variables is 67.

While principal component analysis can be conducted on a sample that has fewer than 100 cases, but more than 50 cases, we should be cautious about its interpretation.

Page 19: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 19

Descriptive Statistics

3.31 .988 67

3.09 .949 67

2.40 .605 67

1.97 .674 67

3.73 .898 67

2.69 .839 67

2.49 .894 67

49.757 18.9651 67

ENVIRONMENTALTHREATSEXAGGERATED

HOW DANGEROUSMODIFYING GENES INCROPS

AMERICAN DOINGENOUGH TO PROTECTENVIRONMENT

INTL AGREEMENTS FORENVIRONMENTPROBLEMS

POOR COUNTRIES LESSTHAN RICH FORENVIRONMENT

ECONOMIC PROGRESSDEPENDENT ONENVIRONMENT

LIKELIHOOD OFNUCLEAR MELTDOWNIN 5 YEARS

RESPONDENT'SSOCIOECONOMIC INDEX

Mean Std. Deviation Analysis N

Sample size requirement:ratio of cases to variables

The ratio of cases to variables in a principal component analysis should be at least 5 to 1.

With 67 and 8 variables, the ratio of cases to variables is 8.4 to 1, which exceeds the requirement for the ratio of cases to variables.

Page 20: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 20

Correlation Matrix

1.000 -.240 .394 -.305 .301 -.117 -.126 .069

-.240 1.000 -.301 .146 -.149 .340 .394 .208

.394 -.301 1.000 -.305 .258 -.136 -.401 .054

-.305 .146 -.305 1.000 -.289 .117 .151 .094

.301 -.149 .258 -.289 1.000 -.194 -.003 .114

-.117 .340 -.136 .117 -.194 1.000 .310 -.018

-.126 .394 -.401 .151 -.003 .310 1.000 .341

.069 .208 .054 .094 .114 -.018 .341 1.000

ENVIRONMENTALTHREATSEXAGGERATED

HOW DANGEROUSMODIFYING GENES INCROPS

AMERICAN DOINGENOUGH TO PROTECTENVIRONMENT

INTL AGREEMENTS FORENVIRONMENTPROBLEMS

POOR COUNTRIES LESSTHAN RICH FORENVIRONMENT

ECONOMIC PROGRESSDEPENDENT ONENVIRONMENT

LIKELIHOOD OFNUCLEAR MELTDOWNIN 5 YEARS

RESPONDENT'SSOCIOECONOMIC INDEX

Correlation

ENVIRONMENTAL

THREATSEXAGGER

ATED

HOWDANGEROUSMODIFYINGGENES IN

CROPS

AMERICANDOING

ENOUGH TOPROTECT

ENVIRONMENT

INTLAGREEMEN

TS FORENVIRONME

NTPROBLEMS

POORCOUNTRIESLESS THANRICH FOR

ENVIRONMENT

ECONOMICPROGRESSDEPENDEN

T ONENVIRONME

NT

LIKELIHOODOF NUCLEARMELTDOWNIN 5 YEARS

RESPONDENT'S

SOCIOECONOMIC INDEX

Appropriateness of factor analysis:Presence of substantial correlations

Principal components analysis requires that there be some correlations greater than 0.30 between the variables included in the analysis.

For this set of variables, there are 10 correlations in the matrix greater than 0.30, satisfying this requirement. The correlations greater than 0.30 are highlighted in yellow.

Page 21: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 21

Appropriateness of factor analysis:Sampling adequacy of individual

variables

Principal component analysis requires that the Kaiser-Meyer-Olkin Measure of Sampling Adequacy be greater than 0.50 for each individual variable as well as the set of variables.

The Measure of Sampling Adequacy (MSA) is described at marvelous if it is 0.90 or greater, meritorious if it is in the 0.80's, middling if in the 0.70's, mediocre if in the in the 0.60's , miserable if in the 0.50's, and unacceptable if below 0.50.

There are two anti-image matrices: the anti-image covariance matrix and the anti-image correlation matrix. We are interested in the anti-image correlation matrix.

Page 22: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 22

Appropriateness of factor analysis:Sampling adequacy of individual

variables

On iteration 1, the MSA for the variable "respondent's socioeconomic index" [sei] was 0.410 which was less than 0.50, so it was removed from the analysis.

SPSS locates the Measures of Sampling Adequacy are on the diagonal of the anti-image correlation matrix, highlighted in yellow.

Page 23: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 23

Excluding a variable from the factor analysis

To remove the variable "respondent's socioeconomic index" [sei] from the analysis, click on the Dialog Recall tool button to access the drop down menu.

Page 24: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 24

Repeating the factor analysis

In the drop down menu, select Factor Analysis to reopen the factor analysis dialog box.

Page 25: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 25

Removing the variable from the list of variables

First, highlight the sei variable.

Second, click on the left arrow button to remove the variable from the Variables list box.

Page 26: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 26

Replicating the factor analysis

The dialog recall command opens the dialog box with all of the settings that we had selected the last time we used factor analysis.

To replicate the analysis without the variable that we just removed, click on the OK button.

Page 27: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 27

Appropriateness of factor analysis:Sample adequacy for revised factor

analysis

In the factor analysis with the sei variable removed, we see that the measures of sampling adequacy for the remaining variables are all greater than 0.50.

On iteration 2, the MSA for all of the individual variables still included in the analysis was greater than 0.5, supporting their retention in the analysis.

Page 28: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 28

KMO and Bartlett's Test

.734

69.350

21

.000

Kaiser-Meyer-Olkin Measure of SamplingAdequacy.

Approx. Chi-Square

df

Sig.

Bartlett's Test ofSphericity

Appropriateness of factor analysis:Sample adequacy for set of variables

In addition, the overall MSA for the set of variables included in the analysis was 0.734, which exceeds the minimum requirement of 0.50 for overall MSA.

The seven variables remaining in the analysis satisfy the criteria for appropriateness of factor analysis.

Page 29: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 29

KMO and Bartlett's Test

.642

84.006

28

.000

Kaiser-Meyer-Olkin Measure of SamplingAdequacy.

Approx. Chi-Square

df

Sig.

Bartlett's Test ofSphericity

Appropriateness of factor analysis:Bartlett test of sphericity

Principal component analysis requires that the probability associated with Bartlett's Test of Sphericity be less than the level of significance.

The probability associated with the Bartlett test is <0.001, which satisfies this requirement.

The next step is to determine the number of factors that should be included in the factor solution.

Page 30: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 30

Number of factors to extract:Latent root criterion

Total Variance Explained

2.461 35.164 35.164 2.461 35.164 35.164 1.849

1.229 17.557 52.721 1.229 17.557 52.721 1.842

.889 12.695 65.415

.722 10.313 75.728

.621 8.878 84.606

.572 8.174 92.780

.505 7.220 100.000

Component1

2

3

4

5

6

7

Total % of Variance Cumulative % Total % of Variance Cumulative % Total

Initial Eigenvalues Extraction Sums of Squared Loadings Rotation Sums of Squared Loadings

Extraction Method: Principal Component Analysis.

Using the output from iteration 2, there were 2 eigenvalues greater than 1.0.

The latent root criterion for number of factors to derive would indicate that there were 2 components to be extracted for these variables.

Page 31: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 31

Number of factors to extract: Percentage of variance criterion

Total Variance Explained

2.461 35.164 35.164 2.461 35.164 35.164 1.849

1.229 17.557 52.721 1.229 17.557 52.721 1.842

.889 12.695 65.415

.722 10.313 75.728

.621 8.878 84.606

.572 8.174 92.780

.505 7.220 100.000

Component1

2

3

4

5

6

7

Total % of Variance Cumulative % Total % of Variance Cumulative % Total

Initial Eigenvalues Extraction Sums of Squared Loadings Rotation Sums of Squared Loadings

Extraction Method: Principal Component Analysis. In contrast, the cumulative proportion of variance criteria would require 3 components to satisfy the criterion of explaining 60% or more of the total variance.

A 3 components solution would explain 65.415% of the total variance.Since the SPSS default is to extract

the number of components indicated by the latent root criterion, our initial factor solution was based on the extraction of 2 components.

Page 32: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 32

Communalities

1.000 .526

1.000 .582

1.000 .521

1.000 .491

1.000 .494

1.000 .437

1.000 .640

ENVIRONMENTALTHREATSEXAGGERATED

HOW DANGEROUSMODIFYING GENES INCROPS

AMERICAN DOINGENOUGH TO PROTECTENVIRONMENT

INTL AGREEMENTS FORENVIRONMENTPROBLEMS

POOR COUNTRIES LESSTHAN RICH FORENVIRONMENT

ECONOMIC PROGRESSDEPENDENT ONENVIRONMENT

LIKELIHOOD OFNUCLEAR MELTDOWNIN 5 YEARS

Initial Extraction

Extraction Method: Principal Component Analysis.

Evaluating communalities

Communalities represent the proportion of the variance in the original variables that is accounted for by the factor solution.

The factor solution should explain at least half of each original variable's variance, so the communality value for each variable should be 0.50 or higher.

Page 33: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 33

Communalities

1.000 .526

1.000 .582

1.000 .521

1.000 .491

1.000 .494

1.000 .437

1.000 .640

ENVIRONMENTALTHREATSEXAGGERATED

HOW DANGEROUSMODIFYING GENES INCROPS

AMERICAN DOINGENOUGH TO PROTECTENVIRONMENT

INTL AGREEMENTS FORENVIRONMENTPROBLEMS

POOR COUNTRIES LESSTHAN RICH FORENVIRONMENT

ECONOMIC PROGRESSDEPENDENT ONENVIRONMENT

LIKELIHOOD OFNUCLEAR MELTDOWNIN 5 YEARS

Initial Extraction

Extraction Method: Principal Component Analysis.

Communality requiring variable removal

On iteration 2, the communality for the variable "economic progress in America will slow down without more concern for environment" [econgrn] was 0.437 which was less than 0.50.

The variable was removed and the principal component analysis was computed again.

In this iteration, there are actually three variables that have communalities less than 0.50. The variable with the smallest communality is selected for removal.

Page 34: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 34

Repeating the factor analysis

In the drop down menu, select Factor Analysis to reopen the factor analysis dialog box.

Page 35: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 35

Removing the variable from the list of variables

First, highlight the econgrn variable.

Second, click on the left arrow button to remove the variable from the Variables list box.

Page 36: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 36

Replicating the factor analysis

The dialog recall command opens the dialog box with all of the settings that we had selected the last time we used factor analysis.

To replicate the analysis without the variable that we just removed, click on the OK button.

Page 37: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 37

Communalities

1.000 .517

1.000 .573

1.000 .576

1.000 .486

1.000 .573

1.000 .716

ENVIRONMENTALTHREATSEXAGGERATED

HOW DANGEROUSMODIFYING GENES INCROPS

AMERICAN DOINGENOUGH TO PROTECTENVIRONMENT

INTL AGREEMENTSFOR ENVIRONMENTPROBLEMS

POOR COUNTRIESLESS THAN RICH FORENVIRONMENT

LIKELIHOOD OFNUCLEAR MELTDOWNIN 5 YEARS

Initial Extraction

Extraction Method: Principal Component Analysis.

Communality requiring variable removal

On iteration 3, the communality for the variable "should be international agreements for environment problems" [grnintl] was 0.486 which was less than 0.50.

The variable was removed and the principal component analysis was computed again.

Page 38: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 38

Repeating the factor analysis

In the drop down menu, select Factor Analysis to reopen the factor analysis dialog box.

Page 39: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 39

Removing the variable from the list of variables

First, highlight the grnintl variable.

Second, click on the left arrow button to remove the variable from the Variables list box.

Page 40: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 40

Replicating the factor analysis

The dialog recall command opens the dialog box with all of the settings that we had selected the last time we used factor analysis.

To replicate the analysis without the variable that we just removed, click on the OK button.

Page 41: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 41

Communalities

1.000 .591

1.000 .568

1.000 .582

1.000 .679

1.000 .725

ENVIRONMENTALTHREATSEXAGGERATED

HOW DANGEROUSMODIFYING GENES INCROPS

AMERICAN DOINGENOUGH TO PROTECTENVIRONMENT

POOR COUNTRIESLESS THAN RICH FORENVIRONMENT

LIKELIHOOD OFNUCLEAR MELTDOWNIN 5 YEARS

Initial Extraction

Extraction Method: Principal Component Analysis.

Communality satisfactory for all variables

Complex structure occurs when one variable has high loadings or correlations (0.40 or greater) on more than one component. If a variable has complex structure, it should be removed from the analysis.

Variables are only checked for complex structure if there is more than one component in the solution. Variables that load on only one component are described as having simple structure.

Once any variables with communalities less than 0.50 have been removed from the analysis, the pattern of factor loadings should be examined to identify variables that have complex structure.

Page 42: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 42

Rotated Component Matrixa

-.265 .722

.732 -.182

-.581 .495

.092 .819

.847 .087

ENVIRONMENTALTHREATSEXAGGERATED

HOW DANGEROUSMODIFYING GENES INCROPS

AMERICAN DOINGENOUGH TO PROTECTENVIRONMENT

POOR COUNTRIESLESS THAN RICH FORENVIRONMENT

LIKELIHOOD OFNUCLEAR MELTDOWNIN 5 YEARS

1 2

Component

Extraction Method: Principal Component Analysis. Rotation Method: Varimax with Kaiser Normalization.

Rotation converged in 3 iterations.a.

Identifying complex structure

On iteration 4, the variable "America doing enough to protect environment" [amprogrn] was found to have complex structure.

Specifically, the variable had a loading of -0.581 on component 1 and a loading of 0.495 on component 2.

The variable should be removed and the principal component analysis should be repeated.

If only one component has been extracted, each variable can only load on that one factor, so complex structure is not an issue.

Page 43: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 43

Repeating the factor analysis

In the drop down menu, select Factor Analysis to reopen the factor analysis dialog box.

Page 44: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 44

Removing the variable from the list of variables

First, highlight the amprogrn variable.

Second, click on the left arrow button to remove the variable from the Variables list box.

Page 45: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 45

Replicating the factor analysis

The dialog recall command opens the dialog box with all of the settings that we had selected the last time we used factor analysis.

To replicate the analysis without the variable that we just removed, click on the OK button.

Page 46: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 46

Rotated Component Matrixa

-.207 .756

.801 -.229

.051 .830

.861 .059

ENVIRONMENTALTHREATSEXAGGERATED

HOW DANGEROUSMODIFYING GENES INCROPS

POOR COUNTRIESLESS THAN RICH FORENVIRONMENT

LIKELIHOOD OFNUCLEAR MELTDOWNIN 5 YEARS

1 2

Component

Extraction Method: Principal Component Analysis. Rotation Method: Varimax with Kaiser Normalization.

Rotation converged in 3 iterations.a.

Checking for single-variable components

On iteration 5, none of the variables demonstrated complex structure.

It is not necessary to remove any additional variables because of complex structure.

After variables have been removed for low communalities and complex structure, the factor solution is examined to remove any components that have only a single variable loading on them.

If a component has only a single variable loading on it, the variable should be removed from the next iteration of the principal component analysis.

Page 47: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 47

Variable loadings on components

Rotated Component Matrixa

-.207 .756

.801 -.229

.051 .830

.861 .059

ENVIRONMENTALTHREATSEXAGGERATED

HOW DANGEROUSMODIFYING GENES INCROPS

POOR COUNTRIESLESS THAN RICH FORENVIRONMENT

LIKELIHOOD OFNUCLEAR MELTDOWNIN 5 YEARS

1 2

Component

Extraction Method: Principal Component Analysis. Rotation Method: Varimax with Kaiser Normalization.

Rotation converged in 3 iterations.a.

On iteration 5, the 2 components in the analysis had more than one variable loading on each of them.

Communalities

1.000 .615

1.000 .694

1.000 .691

1.000 .744

ENVIRONMENTALTHREATSEXAGGERATED

HOW DANGEROUSMODIFYING GENES INCROPS

POOR COUNTRIESLESS THAN RICH FORENVIRONMENT

LIKELIHOOD OFNUCLEAR MELTDOWNIN 5 YEARS

Initial Extraction

Extraction Method: Principal Component Analysis.

The communalities for all of the variables included on the components were greater than 0.50 and all variables had simple structure.

The principal component analysis has been completed.

Page 48: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 48

Rotated Component Matrixa

-.207 .756

.801 -.229

.051 .830

.861 .059

ENVIRONMENTALTHREATSEXAGGERATED

HOW DANGEROUSMODIFYING GENES INCROPS

POOR COUNTRIESLESS THAN RICH FORENVIRONMENT

LIKELIHOOD OFNUCLEAR MELTDOWNIN 5 YEARS

1 2

Component

Extraction Method: Principal Component Analysis. Rotation Method: Varimax with Kaiser Normalization.

Rotation converged in 3 iterations.a.

Interpreting the principal components

The information in 4 of the variables can be represented by 2 components.

Component 1 includes the variables "danger to the environment from modifying genes in crops" [genegen] and "likelihood of nuclear power station damaging environment in next 5 years" [nukeacc]. We can substitute one component variable for this combination of variables in further analyses.

Component 2 includes the variables "claims about environmental threats are exaggerated" [grnexagg] and "poorer countries should be expected to do less for the environment" [ldcgrn]. We can substitute one component variable for this combination of variables in further analyses.

Page 49: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 49

Communalities

1.000 .615

1.000 .694

1.000 .691

1.000 .744

ENVIRONMENTALTHREATSEXAGGERATED

HOW DANGEROUSMODIFYING GENES INCROPS

POOR COUNTRIESLESS THAN RICH FORENVIRONMENT

LIKELIHOOD OFNUCLEAR MELTDOWNIN 5 YEARS

Initial Extraction

Extraction Method: Principal Component Analysis.

Variance explained in individual variables

The components explain at least 50% of the variance in each of the variables included in the final analysis.

Page 50: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 50

Total Variance Explained

1.626 40.651 40.651 1.626 40.651 40.651 1.428

1.119 27.968 68.619 1.119 27.968 68.619 1.317

.694 17.341 85.960

.562 14.040 100.000

Component1

2

3

4

Total % of Variance Cumulative % Total % of Variance Cumulative % Total

Initial Eigenvalues Extraction Sums of Squared Loadings Rotation Sums of Squared Loadings

Extraction Method: Principal Component Analysis.

Total variance explained

The 2 components explain 68.619% of the total variance in the variables which are included on the components.

Page 51: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 51

Total Variance Explained

1.626 40.651 40.651 1.626 40.651 40.651 1.428

1.119 27.968 68.619 1.119 27.968 68.619 1.317

.694 17.341 85.960

.562 14.040 100.000

Component1

2

3

4

Total % of Variance Cumulative % Total % of Variance Cumulative % Total

Initial Eigenvalues Extraction Sums of Squared Loadings Rotation Sums of Squared Loadings

Extraction Method: Principal Component Analysis.

Answering the problem question

The answer to the original question is true with caution.

Component 1 includes the variables "danger to the environment from modifying genes in crops" [genegen] and "likelihood of nuclear power station damaging environment in next 5 years" [nukeacc]. We can substitute one component variable for this combination of variables in further analyses.

Component 2 includes the variables "claims about environmental threats are exaggerated" [grnexagg] and "poorer countries should be expected to do less for the environment" [ldcgrn]. We can substitute one component variable for this combination of variables in further analyses.

The components explain at least 50% of the variance in each of the variables included in the final analysis.

The components explain 68.619% of the total variance in the variables which are included on the components.

A caution is added to our findings because of the inclusion of ordinal level variables in the analysis and the limited sample available for analysis.

Page 52: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 52

Problem 2

2. In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Answer the question based on the results of a principal component analysis prior to testing for outliers, split sample validation, and a test of reliability. Assume that there is no problematic pattern of missing data. Use a level of significance of 0.05.

A principal component analysis of the variables "number of hours worked in the past week" [hrs1], "occupational prestige score" [prestg80], "age" [age], and "highest academic degree" [degree] does not result in any usable components that reduce the number of variable needed to represent the information in the original variables.

1. True 2. True with caution 3. False 4. Inappropriate application of a statistic

Page 53: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 53

Computing a principal component analysis

To compute a principal component analysis in SPSS, select the Data Reduction | Factor… command from the Analyze menu.

Page 54: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 54

Add the variables to the analysis

First, move the variables listed in the problem to the Variables list box.

Second, click on the Descriptives… button to specify statistics to include in the output.

Page 55: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 55

Compete the descriptives dialog box

First, mark the Univariate descriptives checkbox to get a tally of valid cases.

Third, mark the Coefficients checkbox to get a correlation matrix, one of the outputs needed to assess the appropriateness of factor analysis for the variables.

Second, keep the Initial solution checkbox to get the statistics needed to determine the number of factors to extract.

Fourth, mark the KMO and Bartlett’s test of sphericity checkbox to get more outputs used to assess the appropriateness of factor analysis for the variables.

Fifth, mark the Anti-image checkbox to get more outputs used to assess the appropriateness of factor analysis for the variables.

Sixth, click on the Continue button.

Page 56: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 56

Select the extraction method

First, click on the Extraction… button to specify statistics to include in the output.

Page 57: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 57

Compete the extraction dialog box

First, retain the default method Principal components.

Second, click on the Continue button.

Page 58: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 58

Select the rotation method

First, click on the Rotation… button to specify statistics to include in the output.

Page 59: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 59

Compete the rotation dialog box

First, mark the Varimax method as the type of rotation to used in the analysis.

Second, click on the Continue button.

Page 60: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 60

Complete the request for the analysis

First, click on the OK button to request the output.

Page 61: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 61

Level of measurement requirement

"Number of hours worked in the past week" [hrs1], "occupational prestige score" [prestg80], and "age" [age] are interval level variables, which satisfy the level of measurement requirements for a principal component analysis.

"Highest academic degree" [degree] is an ordinal level variable. If we follow the convention of treating ordinal level variables as metric variables, the level of measurement requirement for principal component analysis is satisfied. Since some data analysts do not agree with this convention, a note of caution should be included in our interpretation.

Page 62: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 62

Descriptive Statistics

40.94 12.624 172

45.37 14.218 172

41.81 11.837 172

1.70 1.185 172

NUMBER OF HOURSWORKED LAST WEEK

RS OCCUPATIONALPRESTIGE SCORE (1980)

AGE OF RESPONDENT

RS HIGHEST DEGREE

Mean Std. Deviation Analysis N

Sample size requirement:minimum number of cases

The number of valid cases for this set of variables is 172.

The preferred minimum sample size requirement of 100 valid cases is satisfied.

Page 63: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 63

Descriptive Statistics

40.94 12.624 172

45.37 14.218 172

41.81 11.837 172

1.70 1.185 172

NUMBER OF HOURSWORKED LAST WEEK

RS OCCUPATIONALPRESTIGE SCORE (1980)

AGE OF RESPONDENT

RS HIGHEST DEGREE

Mean Std. Deviation Analysis N

Sample size requirement:ratio of cases to variables

The ratio of cases to variables in a principal component analysis should be at least 5 to 1.

With 172 and 4 variables, the ratio of cases to variables is 43.0 to 1, which exceeds the requirement for the ratio of cases to variables.

Page 64: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 64

Correlation Matrix

1.000 .160 .021 .174

.160 1.000 .060 .525

.021 .060 1.000 .068

.174 .525 .068 1.000

NUMBER OF HOURSWORKED LAST WEEK

RS OCCUPATIONALPRESTIGE SCORE (1980)

AGE OF RESPONDENT

RS HIGHEST DEGREE

Correlation

NUMBER OFHOURS

WORKEDLAST WEEK

RSOCCUPATIONAL

PRESTIGE SCORE

(1980)

AGE OFRESPON

DENTRS HIGHEST

DEGREE

Appropriateness of factor analysis:Presence of substantial correlations

Principal components analysis requires that there be some correlations greater than 0.30 between the variables included in the analysis.

For this set of variables, there is 1 correlation in the matrix greater than 0.30, which does not satisfy this requirement.

Since we do not satisfy a basic requirement for principal component factor analysis, the answer to the question is inappropriate application of a statistic

Page 65: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 65

Steps in principal component analysis - 1

The following is a guide to the decision process for answering problems about principal component analysis:

Is the number of valid cases 50 or more?

Yes

No

Incorrect application of a statistic

Yes

NoAre the variables included in the analysis metric?

Is the ratio of cases to variables at least 5 to 1?

Yes

No

Incorrect application of a statistic

Incorrect application of a statistic

Page 66: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 66

Steps in principal component analysis - 2

Is the measure of sampling adequacy larger than 0.50 for each variable?

Yes

No

Incorrect application of a statistic

Yes

NoAre there two or more correlations that are 0.30 or greater?

Overall measure of sampling adequacy greater than 0.50?

Yes

NoIncorrect application of a statistic

Remove variable with lowest MSA and repeat analysis

Page 67: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 67

Steps in principal component analysis - 3

Incorrect application of a statistic

Yes

NoProbability for Bartlett test of sphericity less than level of significance?

Communality for each variable greater than 0.50?

Yes

No Remove variable with lowest communality and repeat analysis

Page 68: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 68

Steps in principal component analysis - 4

Does all variables show simple structure (only 1 loading > 0.40)?

Yes

No

Do all of the components have more than one variable loading on it?

Yes

No

Remove variable with complex structure and repeat analysis

FalseNoAre the number of

components and pattern of loadings correct?

Remove single variable loading on component and repeat analysis

Yes

Page 69: Principal component analysis

SW388R7Data Analysis

& Computers II

Slide 69

Steps in principal component analysis - 5

Are all of the metric variables included in the final analysis interval level?

Yes

No

Is the sample size for the final analysis include 100 cases or more?

Yes

No

True with cautionNoIs the cumulative

proportion of variance for variables 60% or higher?

Yes

True with caution

True with caution

True