exploratory data analysis using math 1.7 data of carsu

64
CHAPTER 1 INTRODUCTION Many scientific studies are featured by the fact that “numerous variables are used to characterize objects” [1]. Examples are studies in which questionnaires are used that consist of a lot of questions (variables), and studies in which mental ability is tested via several subtests, like skills tests, logical reasoning ability tests, etc. [2]. Because of these big numbers of variables that are into play, the study can become rather complicated in the sense that as we add more and more variables, more and more overlap. For situations such as these, Exploratory Factor Analysis (EFA) has been invented. Broadly speaking, factor analysis provides the tools for analyzing the structure of interrelationships among a large number of variables by defining sets of variables that are highly interrelated, known as factors which are assumed to represent dimensions within the data and to partially or completely replace the original set of variables [5]. 1

Upload: jayroldparcede

Post on 20-Jul-2016

19 views

Category:

Documents


0 download

DESCRIPTION

Thesis written by undergraduate mathematics at Caraga State University in partial fullfilment of the Bachelor of Science in Mathematics major in Applied Statistics

TRANSCRIPT

Page 1: Exploratory data analysis using Math 1.7 data of CARSU

CHAPTER 1

INTRODUCTION

Many scientific studies are featured by the fact that “numerous variables are used to

characterize objects” [1]. Examples are studies in which questionnaires are used that consist of a

lot of questions (variables), and studies in which mental ability is tested via several subtests, like

skills tests, logical reasoning ability tests, etc. [2]. Because of these big numbers of variables

that are into play, the study can become rather complicated in the sense that as we add more and

more variables, more and more overlap. For situations such as these, Exploratory Factor Analysis

(EFA) has been invented. Broadly speaking, factor analysis provides the tools for analyzing the

structure of interrelationships among a large number of variables by defining sets of variables

that are highly interrelated, known as factors which are assumed to represent dimensions within

the data and to partially or completely replace the original set of variables [5].

This paper explores the use of EFA as a variable-reduction multivariate technique.

Further, it assesses results on different data formats.

The goal of this paper is to discuss common practice in studies using EFA and provide

practical information on best practices in the use of EFA. In particular we discuss three issues:

(1) To determine what type of data format is consistent for EFA technique.

(2) To evaluate which type of factor rotation is the most appropriate for EFA.

(3) To assess results on split-sampling of EFA.

1

Page 2: Exploratory data analysis using Math 1.7 data of CARSU

1.1 Preliminaries

A hypothesis is a tentative theory which aims to explain facts about the real world. In

statistics, a statistical hypothesis is a conjecture about a population parameter. This conjecture

may or may not be true. This parameter is a characteristic or measure obtained by using all the

data values for a specific population. This particular hypothesis includes the null hypothesis and

alternative hypothesis. For instance, when the statement indicates there is no difference between

two parameters, the hypothesis is a null hypothesis denoted by H0; but where there is a difference

between two parameters, the hypothesis is an alternative hypothesis denoted by H1[8].

Let n be the number of observations for two variables x and y. As defined in [7], a

correlation, denoted by r, is given by r=n∑

1

n

xy−(∑1

n

x)(∑1

n

y )

√ [n∑1

n

x2−(∑1

n

x)2] [n∑1

n

y−(∑1

n

y)2]

, is a single

number that describes the strength of relationship of x and y. A zero correlation indicates no

relationship between x and y. When x and y have a positive correlation, x and y move in the

same direction, i.e., as x increases, y also increases or the other way around. On the other hand,

when x and y have a negative correlation, x and y move in the opposite direction. That is, when

x increases, y decreases or vice versa. In addition, a partial correlation is the relationship of two

variables when the effects of the two or more related variables are removed. A correlation

matrix is a symmetric matrix showing the intercorrelations among all variables. Its diagonal has

a uniform correlation value of 1.000 which is the correlation of the variable within itself. The

2

Page 3: Exploratory data analysis using Math 1.7 data of CARSU

number of correlations (m) in a correlation matrix is given by m=n2−n2

, where n is the

number of variables.

To exemplify the concept of correlation and partial correlation, consider the 26 variables

of the prelim examination results in Table 1 and Table 2 of Appendix B. Table 1 is the

correlation matrix of the 26 variables which has 325 correlations and on its diagonal is a uniform

correlation value of 1.00 which is the correlation of each variable onto itself. Moreover, variables

X16 (adding unlike terms) and X18 (incorrect application of DPMA) has a correlation value of 0.96

which implies that X16 and X18 is 96 percent highly and positively correlated (i.e., X16 and X18

move in the same direction that when X16 increases, X18 also increases or when X16 decreases, X18

also decreases). On the other hand, Table 2 presents the partial correlations of the variables.

Here,

3

Page 4: Exploratory data analysis using Math 1.7 data of CARSU

When a certain study has been conducted, one of the goals of the researcher is to describe

the characteristics of the data set. First attempt is usually on the descriptive measures of the data,

that is, the measures of central tendency which serves to locate the center of the data set and the

measures of dispersion.

Let x1, x2, . . . , xn be the n observations. The measures of central tendency include mean,

median and mode. The median of n observations according to [11] can be defined as the

“middlemost” value once the data are arranged according to size. More precisely, if n is an odd

number, the median is the value of the observation numberedn+1

2 ; if n is an even number, the

median is defined as the average of the observations numbered n2 and

n+22 .

On the other hand, one measure of the data dispersion is the sample variance, denoted by

s2, is given by s2=

∑i=1

n

x i2−( 1

n∑i=1

n

x i)2

n−1 [10, 11, 12].

According to [5], when a variable x is correlated with variable y, x shares variance of y,

and the amount of sharing between x and y is simply the squared correlation. So, from the

example of the correlation, X16 and X18 which has a correlation value of 0.96 shares 92 percent of

their variance.

Moreover, common variance is defined as that variance in a variable that is shared with

all other variables in the analysis. This variance is accounted for or shared based on a variable’s

correlations with all other variables in the analysis. A variable’s communality is the estimate of

its shared variance among the variables.

4

Page 5: Exploratory data analysis using Math 1.7 data of CARSU

1.2 Factor Analysis Decision Process [5]

The ultimate goal of any multivariate technique is to obtain reliable results and gain

informative interpretation of the data. To achieve its goal, factor analysis follows a seven-stage-

model-building paradigm. Figure 1 shows the seven stages of EFA.

As shown in the figure below, the starting point of EFA is the research problem. In this

stage, the researcher decides based on the objective what type of factor analysis will be

employed. If the objective of the study is only to summarize the data, i.e., to identify latent

dimensions within the data then the researcher employs the confirmatory factor analysis. In

contrast, if the primary objective of the study centers on data reduction, i.e., extends the

summarization of the data by deriving estimates such as factor score and composite summated

scale, then exploratory factor analysis will be appropriate.

As an illustrative example of the EFA decision process, we consider the application of

EFA on the data of the prelim departmental examination results of the Math 1.7 students. There

are 26 variables, metric in rubric form. These variables are the weaknesses found of each student

in the problem solving part of the exam. Since the objective of the analysis is to reduce the 26

variables, then exploratory factor analysis will be used.

5

Page 6: Exploratory data analysis using Math 1.7 data of CARSU

The next stage involves decision on designing the factor analysis. When variables are to

be grouped, R-type factor analysis will be utilized. While, if the researcher decides on grouping

the respondents, then some type of EFA of respondents such as Q-type factor and cluster

analysis will be used.

Continuing the illustrative example, the design of our factor analysis will be R-type factor

analysis since we tend to group variables.

Stage 1

Confirmatory

Exploratory

Cases Variables

Stage 2

Stage 3

Stage 4

Common Factor Analysis Principal Component Analysis

Stage 5Orthogonal

Oblique

No

6

Research Problem

Selecting a Factor Method

Assumptions

Research Design

Select the type of Factor Analysis

Structural Equation Modeling

Factor Model Respecification

Interpreting the Rotated Factor Matrix

Selecting a Rotational Method

Specifying the Factor Matrix

Page 7: Exploratory data analysis using Math 1.7 data of CARSU

Yes

Yes

No

Stage 6Stage 7

Figure 1: Stages of EFA

The third stage of EFA focuses on checking the assumptions. First to consider is the

selection of variable and the appropriateness of sample size. Metric variables are the most

appropriate in EFA since it can easily used the typical correlation matrix of variables, but if a

nonmetric variable should be included in the analysis, one approach is to define dummy

variables (coded 0-1) to represent categories on nonmetric variables. If all the variables are

dummy variables, then specialized forms of factor analysis, such as Boolean factor analysis, are

more appropriate. Regarding the sample size question, the researcher generally would not factor

analyze a sample of fewer than 50 observations, and preferably the sample size should be 100 or

larger. As a general rule, the minimum is to have at least five times as many observations as the

number of variables to be analyzed, and the more acceptable sample size would have a 10:1

ratio.

Another method of determining the appropriateness of factor analysis examines the entire

correlation matrix. The Bartlett test of sphericity tests the null hypothesis that the original

correlation matrix is an identity matrix. For factor analysis to work, we need this test to be

7

Computation of Factor Scores

Page 8: Exploratory data analysis using Math 1.7 data of CARSU

significant (i.e., p-value < .05) so that our correlation matrix is not an identity matrix; therefore,

there are some relationships between the variables we hope to include in the analysis.

A third measure to quantify the degree of intercorrelations among the variables and the

appropriateness of factor analysis is the measure of sampling adequacy (MSA). This index ranges

from 0 to 1, reaching 1 when each variable is perfectly predicted without error by the other

variables. As a rule of thumb, MSA below .50 is unacceptable.

In addition to a visual examination of a variable’s correlations with the other variables in

the analysis, the MSA guidelines can be extended to individual variables. The researcher should

examine the MSA values for each variable and exclude those falling in the unacceptable range.

In deleting variables, the researcher should first delete the variable with the lowest MSA and

then recalculate the factor analysis. Continue this process of deleting the variable with the lowest

MSA value under .50 until all variables have an acceptable MSA value.

In addition, the researcher must also ensure that the data matrix has sufficient correlations

to justify the application of factor analysis. If visual inspection of the correlation matrix reveals

no substantial number of correlations greater than .30, then factor analysis is probably

inappropriate.

For checking the assumptions of our example, refer to the tables in Appendix B. In this

example, we have a sample size of 864 which falls within acceptable limits. The 26 variables are

in rubric form which is metric. Inspection of the correlation matrix in Table 1 reveals presence of

correlations greater than .300 (shaded in color) which provides an adequate basis for proceeding

to an empirical examination of adequacy for factor analysis on both an overall basis and for each

variable. Bartlett test (p-value=.000: highly significant) finds that the correlations, when taken

8

Page 9: Exploratory data analysis using Math 1.7 data of CARSU

collectively indicates presence of nonzero correlations which implies that variables in the

analysis shows interrelationships with each other. Clearly, our correlation matrix shows that

correlations of the variables are not zero. Further checking of the assumptions, Table 3 presents

the MSA values and the variables deleted in the analysis. Here, the overall MSA value .466 does

not fall on the acceptable range Examining the individual variable’s MSA which are the numbers

on the diagonal of Table 2, identifies fifteen variables (X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11,

X12, X13, X14, X15) that have MSA values under .50. Because X19 the lowest MSA value, it will be

omitted in the attempt to attain a set of variables that can exceed the minimum acceptable MSA

levels. Recalculating the MSA values finds again another variable with MSA value below .50

(listed in Table 2) so it is also deleted from the analysis. The process in recalculating the MSA

values continues until all variables meet the acceptable MSA level. For this, we come up with an

overall MSA value of .594. As we observed, the deletion of those variables increases the overall

MSA of the analysis. As a result, Table 3 contains the correlation matrix for the revised set of

variables, that is, variables with MSA below .50 were deleted. As with the full set of variables,

Bartlett test shows nonzero correlations and correlation matrix shows presence of correlations

greater than .30 (in color). Finally, examining the partial correlations in Table 4 shows only two

with values greater than .50 (X5-X4, X12-X16: values in color) which is another of indicator of the

strength of the interrelationships among the variables in the reduced set. As shown, the reduced

set of variables meets the fundamental requirements and is appropriate for factor analysis, and

the analysis can proceed to the next stages.

The next stage involves decisions concerning the method of extracting the factors and the

selection of factors to be retained. The researcher can choose from two similar, yet unique,

methods for extracting the factors, namely; the common factor analysis (CFA) or the principal

9

Page 10: Exploratory data analysis using Math 1.7 data of CARSU

component analysis (PCA). The decision on the method to use is based on the objectives of the

factor analysis.

PCA is most appropriate when data reduction is a primary concern, focusing on the

minimum number of factors needed to account for the maximum portion of the total variance

represented in the original set of variables. When PCA is used as an extraction method, then

extracted linear combination of variables that are highly interrelated is called component.

In contrast, CFA is most appropriate when the primary objective is to identify the latent

dimension or constructs represented in the original variables. When used, extracted linear

combination of variables is called factor.

Both factor analysis methods are interested in the best linear combination of

variables–best in the sense that the particular combination of original variables accounts for more

of the variance in the data as a whole than any other linear combination of variables. Therefore,

the first factor may be viewed as a single best summary of linear relationships exhibited in the

data. The second factor is defined as the second-best linear combination of the variables, subject

to the constraint that it is orthogonal to the first factor. To be orthogonal to the first factor, the

second factor must be derived from the variance remaining after the first factor has been

extracted. Thus, the second factor may be defined as the linear combination of variables that

accounts for the most variance that is still unexplained after the effect of the first factor has been

removed from the data. The process continues extracting factors accounting for smaller and

smaller amounts of variance until all of the variance is explained.

In deciding when to stop factoring, the researcher generally begins with some

predetermined criteria, such as the general number of factors plus some general thresholds of

10

Page 11: Exploratory data analysis using Math 1.7 data of CARSU

practical relevance (e.g., required of percentage of variance explained). An exact quantitative

basis for deciding the number of factors to extract has not been developed. However, the some

stopping criteria for the number of factors to extract are currently being utilized.

The latent root criterion is the most commonly used technique and is apply to either PCA

or CFA. The rationale of this criterion is that any individual factor should account for the

variance of at least a single variable if it is to be retained for interpretation. Thus, only the factors

having latent roots or eigenvalues greater than 1 are considered significant.

Furthermore, the percentage of variance criterion is an approach based on achieving a

specified cumulative percentage of total variance extracted by successive factors. The purpose is

to ensure practical significance for the derived factors by ensuring that they explain at least a

specified amount of variance. No absolute threshold has been adopted for all applications.

However, in the social sciences, where information is often less precise, it is not uncommon to

consider a solution that accounts for 60 percent of the total variance.

Another technique for factor retention is the scree test criterion, is a graphical approach

which is derived by plotting the latent roots against the number of factors in their order of

extraction, and the shape of the resulting curve is used to evaluate the cutoff point.

In our example ,we have the primary objective of variable-reduction and so PCA is the

most appropriate tool for factor extraction. Table 5 contains the information regarding the 14

possible factors and their relative explanatory power as expressed by their eigenvalues. Applying

the latent root criterion of retaining factors with eigenvalues greater than 1.0, five factors will be

retained. However, the five factors retained represent only 59.495 percent of the variance of the

14 variables, which is not sufficient in terms of total variance explained. The scree test in Figure

11

Page 12: Exploratory data analysis using Math 1.7 data of CARSU

1 of Appendix B, however, indicates that six factors may be appropriate for retention

accompanied by 66.531 percentage of the total variance explained, deemed to meet the

requirement. Considering the eigenvalue of the sixth factor, it is been quite close to 1, relative to

the latent root criterion value of 1.0 precluded its inclusion. Combining all these criteria together

leads to the conclusion to retain six factors for further analysis.

The fifth stage of EFA is the factor interpretation. After defining the number of factors to be

retained for interpretation, the researcher will now examine the unrotated factor loading matrix

containing the factor loadings for each variable on each factor. Factor loadings are the

correlation of each variable and the factor. Through factor loadings of each variable to the

factors, the researcher can assess each variable’s communality whether it falls within acceptable

limits. Communality is equal to the total sum of the squared factor loadings of each variable to

the factors. It is specified that at least one-half of the variance of each variable must be taken into

account. The size of the communality is a useful index for assessing how much variance in a particular

variable is accounted for by the factor solution. Higher communality values indicate that a large amount

of variance in a variable has been extracted by the factor solution. Small communalities show that a

substantial portion of the variable is not accounted for by the factors. Although no statistical guidelines

indicate exactly what is “large” or “small”, practical considerations dictate a lower level of .50 for

communalities in this analysis. Using this guideline, all variables identified with communalities less

than implies no sufficient explanation.

When a variable x has communality below .500, then x will be deleted and recalculate the

analysis until all variables have communalities greater than .500.

The total amount of variance explained by either a single factor or the overall factor solution can

be compared to the overall variation in the set of variables as represented by the trace of the factor

12

Page 13: Exploratory data analysis using Math 1.7 data of CARSU

matrix. The trace is the total variance to be explained and is equal to the sum of the eigenvalues of the

variable set. In component analysis, the trace is equal to the number of variables because each variable

has a possible eigenvalue of 1.0. By adding the percentages of trace for each of the factors (or dividing

the total eigenvalues of the factors by the trace), we obtain the total percentage of trace extracted for the

factor solution. This total is used as an index to determine how well a particular factor solution accounts

for what all the variables together represent. If the variables are all very different from one another, this

index will be low. If the variables fall into one or more highly redundant or related groups, and if the

extracted factors account for all the groups, the index will approach 100 percent.

When all the variables have communality falls in acceptable level, the researcher would

then examine significant factor loadings in the unrotated factor matrix.

The process of interpretation would be greatly simplified if each variable had only

one significant loading. When a variable is found to have more than one significant loading, it is

termed as cross-loading. The difficulty arises when a variable results to several significant

loadings. If a variable persists in having cross-loadings, it becomes a candidate for deletion.

In most cases, most of the variables in the unroted factor matrix cross-loads. Because of

this, the interpretation is difficult and quite meaningless and we need to rotate the factors hoping

to find a more simplified structure.

In factor analysis, there are two types of factor rotation. These are orthogonal factor

rotation and oblique factor rotation. The first type of factor rotation is subject to the constraint

that the axes of rotation is maintain on a 90 degree position which assures that whenever EFA

uses orthogonal rotation components extracted are uncorrelated. In other words, variables in each

component are cannot be explained by the other variables on the other components. In contrast,

the second type of rotation is similar to the first, except that oblique rotations allow correlated

13

Page 14: Exploratory data analysis using Math 1.7 data of CARSU

factors instead of maintaining independence between the rotated factors. Thus, some variables on

one component can be explain by other variables on the other component.

In practice, the objective of all methods of rotation is to simplify the rows and columns of

the factor matrix to facilitate interpretation. In a factor matrix, columns represent factors, with

each row corresponding to a variables’ loading across the factors. By simplifying the rows, we

mean making as many values in each row as close to zero as possible (i.e., maximizing a

variable’s loading on a single factor). By simplifying the columns, we mean making as many

values in each column as close to zero as possible (i.e., making the number of high loadings as

few as possible).

Three major orthogonal approaches have been developed. First is the QUARTIMAX

rotation where its ultimate goal of is to simplify the rows of a factor matrix. In contrast to

QUARTIMAX, the VARIMAX criterion centers on simplifying the columns of the factor

matrix. Finally, is the EQUIMAX approach which is a compromise between the QUARTIMAX

and VARIMAX approaches.

On the other side, Direct Oblimin is the default of oblique factor rotation in SPSS.

No specific rules have been developed to guide the researcher in selecting a particular

orthogonal or oblique rotational technique. The choice should be made on the basis of the

particular needs of a given research problem.

When the objective of the study is data reduction it is further to note that orthogonal

factor rotation using VARIMAX approach is more appropriate.

14

Page 15: Exploratory data analysis using Math 1.7 data of CARSU

As a final process, the researcher evaluates the rotated factor loadings for each variable in

order to determine that variable’s role and contribution in determining the factor structure. A .30

loading translates to approximately 10 percent explanation, and a .50 loading denotes that 25

percent of the variance is accounted for by the factor. The loading must exceed .70 for the factor

to account for 50 percent of the variance of a variable. Thus, larger absolute size of the factor

loading, the more important the loading in interpreting the factor matrix.

With the objective of obtaining a power level of 80%, the use of a .05 significance level,

and the proposed inflation of the standard errors of factor loading, Table 1 contains the sample

sizes necessary for each factor loading value to be considered significant.

Table 1: Guidelines for Identifying Significant Factor Loadings Based on Sample Size

When an acceptable factor solution has been obtained in which all variables have a

significant loading on a factor, the researcher attempts to assign more meaning to the pattern of

factor loadings. Variables with higher loadings are considered more important and have greater

influence on the name or label selected to represent a factor. This label is not derived or assigned

by the factor analysis computer program; rather, the label is intuitively developed by the

15

Page 16: Exploratory data analysis using Math 1.7 data of CARSU

researcher based on its appropriateness for representing the the underlying dimensions of a

particular factor.

For our example, Table 6 contains the unrotated factor loading matrix of the 6 components

retained. The seventh column provides summary statistics detailing how well each variable is explained

by the six components. For instance, the communality figure of .548 for variable X15 indicates that it has

less in common with the other variables included in the analysis than does variable X16, which has a

communality of .900. Both variables, however, still share more than one-half of their variance with the

six factors. On the other side, two variables with communality below .500 are depicted (X9 and X12), which

will be eliminated.

With X9 and X12 eliminated, Table 7 displays the factor loading matrix of revised set of variables with

five components extracted. In the table another two variables (X5 and X14) was observed to have

communalities below .500 which requires deletion in the analysis. Table 8 portrays the revised factor

loading matrix with X5 and X14 deleted. Examination of the table reveals no substantial communality of the

variables below .50. Thus, analysis will proceed to the next stage.

In Table 8, the first row of numbers at the bottom of each column is the column sums of squared

factor loadings (eigenvalues) and indicates the relative importance of each factor in accounting for the

variance associated with the set of variables. Note that sums of squares for the six factors are 2.4, 1.79,

1.53, 1.11, and 1.08, respectively. As expected, the factor solution extracts the factors in the order of

their importance, with factor 1 accounting for the most variance, factor 2 slightly less, and so on through

all the six factors. The total eigenvalues which is 7.91 represents the total amount of variance extracted

by the factor solution. The percentage of trace explained by each of the six factors (21.82%, 16.27%,

13.91%, 10.09%, and 9.82%, respectively) are shown as the last row of values of Table 8. The

index for the overall solution shows that 71.91 percent of the total variance is represented by the

16

Page 17: Exploratory data analysis using Math 1.7 data of CARSU

information contained in the factor matrix of the six-factor solution. Therefore, the index for

this solution is high, and the variables are in fact highly related to one another.

Given the sample size of 864, factor loadings of .30 and higher will be considered

significant for interpretative purposes. Numbers in shaded color in Table 8 indicates the

significant loadings. Here, 10 of the 11 variables has cross-loadings, thus, we need to rotate the

factors to obtain a more simplified structure.

The VARIMAX-rotated component analysis factor matrix is shown in Table 10. Note

that the total amount of variance extracted is the same in the rotated solution as it was in the

unrotated one, 71.91 percent. Also, the communalities for each variable do not change when a

rotation technique is applied. Still, two differences do emerge. First, the variance is redistributed

so that the factor-loading pattern and the percentage of variance for each of the factors are

slightly different. Specifically, in the VARIMAX-rotated factor solution, the first factor

accounts for 20.09 percent of the variance, compared to 21.82 percent in the unrotated solution.

Likewise, the other factors also change, the largest change being the fourth factor, increasing

from 10.09 percent in the unrotated solution to 11.55 percent in the rotated solution. Thus, the

explanatory power shifted slightly to a more even distribution because of the rotation. Second,

the interpretation of the factor matrix is simplified.

Having defined the various elements of the rotated factor matrix, let us examine the

pattern of significant factor loading hoping to find a simplified structure.

In the rotated factor solution each of the variables has a significant loading on one factor

except for X1, X11 and X15. Moreover, variables with no cross-loadings exhibits factor loadings

above .50, meaning that more than one-half of the variance is accounted for by the loading on a

17

Page 18: Exploratory data analysis using Math 1.7 data of CARSU

single factor. With all of the communalities of sufficient size to warrant inclusion, the only

remaining decision is to delete X1, X11 and X15.

Deleting X1, X11 and X15, Table 11(left-side) displays the revised set of the factor-loading

matrix. The matrix shows simplified structure of components but assessing communalities of the

individual variables reveals X3and X7 with communalities below .50. Finally, deleting those two

variables leaves us with 6 variables in the analysis. As we see, the factor loadings for the six

variables remain almost identical, exhibiting both the same pattern and almost the same values

for the loadings. The amount of explained variance increases up to 82 percent. With the

simplified pattern of loadings, all communalities above 50 percent, and the overall level of

explained variance is high enough, the six-variable/three-factor solution is accepted.

The sixth stage involves assessing the degree of generalizability of the results to the

population The most direct method of validating the results is to split the sample of the original

data set and assess the replicability of the results. Factor stability is the primary concern to

assess robustness of the solution across the sample.

1.5.2.7 Additional Uses of Factor Analysis Results

This stage includes two options: (a) Selecting the variable with the highest factor loading

as a surrogate representative for a particular factor dimension; (b) Replacing the original set of

variables with an entirely new, smaller set of variables created either from summated scales or

factor scores.

If the researcher’s objective is simply to identify appropriate variables for subsequent

application with other statistical techniques, the researcher has the option of examining the

18

Page 19: Exploratory data analysis using Math 1.7 data of CARSU

factor matrix and selecting the variable with the highest factor loading on each factor to act as a

surrogate variable that is representative of that factor.

1.6 Statistical Packages

1.6.1 SPSS [9]

SPSS (Statistical Packages for Social Sciences) is a statistical analysis and data

management software package. SPSS can take data from almost any type of file and use them to

generate tabulated reports, charts, and plots of distribution and trends, descriptive statistics, and

conduct complex statistical analyses.

Moreover, SPSS is a powerful, user-friendly software package for the manipulation and

statistical analysis of data. The package is necessary in the field of Social Sciences such as

psychology, sociology, psychiatry, and many other containing as it does an extensive range of

both univariate and multivariate procedures.

19

Page 20: Exploratory data analysis using Math 1.7 data of CARSU

CHAPTER 2

METHODOLOGY

2.1 The Data

The data used in the empirical analysis was obtained from the results of the Prelim,

Midterm, and Final Departmental Examination of the students taking up Math 1.7 of the Caraga

State University (CSU)-Main Campus, Ampayon, Butuan City during the first (1st) semester of

SY 2012-2013.There were 864 students who took the Prelim Departmental Examination, 500 in

the Midterm, and 565 during the Final Examination.

The data focuses on the unmastered competencies (weaknesses) of the students in

Problem Solving. Manipulating the data on each departmental examination results yields 26

metric variables for prelim, 29 for midterm and 28 for final. Refer to Appendix C for the list of

variables and its corresponding deficiency indicator.

2.2 Data Format

20

Page 21: Exploratory data analysis using Math 1.7 data of CARSU

To achieve the specified objectives of this study, we use three methods of quantification

of our data. Table 2.2.1 presents each type of data format with its corresponding description and

illustration.

Table 2.2.1 Description and Illustration of each Data Format arrived for the use of EFA.

Data Format Description Illustration

Rubric-Score RS=Weight-Score Level of weakness of each student is

taken from the item of each problem

minus its students’ corresponding

score.

Likert Scale Equally-distant 5 category from

scale 1 to 5 in each item

Total weights of each weakness are

sum up, using it in arrive the equally-

distant interval scale from 1 to 5.

Thurstone Median-based rate (0-5) on each

weakness regarding on how

difficult each problem

Three persons rate each weaknesses

and the median is derived making it the

standard score.

2.3 Computing in SPSS

The following procedures were utilized in the analysis of data using the SPSS.

2.3.1 Starting up SPSS and Data Entry

21

Page 22: Exploratory data analysis using Math 1.7 data of CARSU

As shown in Figure 2.3.1.1,

1. In the search programs and files in the computer, type SPSS and click.

2. From Excel where the raw data are being stored, copy and paste them to the SPSS

Spreadsheet data editor (DATA VIEW).

3. Click the VARIABLE VIEW found just below the Spreadsheet data editor. Then label,

arrange and fix all variables ready for the empirical data analysis.

2.4 Data Analysis

As shown in Figure 2.4.1, data were analyzed using the following steps:

1. After fixing all variables in the data, click ANALYZE in the menu bar of SPSS. Find the

DATA REDUCTION item and select FACTOR.

2. Output in Figure 2.3.1 (b) will pop out, check DESCRIPTIVES and all the boxes present for

the STATISTICS and CORRELATION MATRIX output. Then, click continue.

3. Click EXTRACTION in the left of Descriptives. Then choose PRINCIPAL

COMPONENTS as the extraction method and check for the CORRELATION MATRIX,

UNROTATED FACTOR SOLUTION and SCREE PLOT. Also, choose EXTRACT

EIGENVALUES GREATER THAN 1 and use the default for the maximum number of

convergence which is 25. Then, again click continue.

4. In the left of Extraction, click ROTATION and then choose VARIMAX and check

ROTATED SOLUTIONS and LOADING PLOTS. Again, use the default for the maximum

iterations of convergence equal to 25. Click continue.

22

Page 23: Exploratory data analysis using Math 1.7 data of CARSU

5. To continue, click SCORES. Under factor analysis factor scores, check SAVE AS

VARIABLES and choose BARTLETT and check display factor score coefficient matrix.

Again, click continue.

6. Finally, click OPTIONS. Under missing values, choose exclude cases pairwise. Check

sorted by size and suppress absolute value less than the corresponding significant loading

corresponding to the sample size of the data analyzed.

7. The OUTPUT NAVIGATOR appears and displays the desired statistical results and tables

from the analysis performed.

8. Repeat the same method until to the last set of data.

(a) (b)

23

Page 24: Exploratory data analysis using Math 1.7 data of CARSU

(c

Figure 2.3.1.1 Starting up SPSS and Data Entry

Figure 2.4.1 The Data Analysis

24

Page 25: Exploratory data analysis using Math 1.7 data of CARSU

25

Page 26: Exploratory data analysis using Math 1.7 data of CARSU

(e) (f)

(h) (g)

26

Page 27: Exploratory data analysis using Math 1.7 data of CARSU

(i) (j)

(l) (k)

27

Page 28: Exploratory data analysis using Math 1.7 data of CARSU

CHAPTER 3

RESULTS AND DISCUSSION

This chapter presents the results on the numerical experiment with Exploratory Factor

Analysis (EFA) using empirical data. The empirical data were obtained from the problem solving

28

Page 29: Exploratory data analysis using Math 1.7 data of CARSU

part of the departmental examinations of College Algebra (Math 1.7) implemented in Caraga

State University (CSU) in the first semester of the school year 2012-2013.

The first section presents the checking of assumptions. Particularly, inspection of the

MSA values and Bartlett test p-values is being examined. Variables with MSA values less

than .500 are deleted and analysis is recalculated again and again until all variables left have

MSA values on the acceptable range.

On the other hand, the second section shows the exploratory factor analysis of the three

different data format across departmental examinations. The number of components (Co) and the

percentage of variance explained (%VE) are considered to be the criteria in choosing the most

efficient data format where EFA is consistent.

Moreover, the third part presents the comparison of the two types of EFA factor rotation,

namely, orthogonal and oblique factor rotation. These rotations will be compared on the data

format chosen in the second section. Details of consistencies and inconsistencies of the factor

loadings are used to examine what type of factor rotation is the most appropriate.

Finally, the last section assesses the validation of EFA on split-sampling (60%-40%,

50%-50%) on rubric data format. Variable composition on each component and variance

explained serve as basis of stability of the factors.

3.1 The Data Analysis

Table 3.1.2: Overall MSA and Individual Variables’ MSA

29

Page 30: Exploratory data analysis using Math 1.7 data of CARSU

Data Source Prelim Midterm Final

Iteration

Bartlett

test

(p-value)

Overall

MSA

Variables

with MSA

less than .5

Overall

MSA

Variables

with MSA

less than .5

Overall

MSA

Variables

with MSA

less than .5

1 .000 .466 X19 =.131 .543 X24 =.399 .544 X11 =.450

2 .000 .526 X22 =.288 .554 X18 =.441 .547 X15 =.458

3 .000 .542 X23 =.366 .582 X20 =.427 .552 X14 =.446

4 .000 .554 X6 & X21

=.391.594 X22 =.442 .555 X24 =.457

5 .000 .565 X20 =.434 .598 X21 & X25

=.462.560 X4 =.480

6 .000 .569 X10 =.418 .606 X17 & X14

=.496.563 None

7 .000 .576 X13 & X24

=.452.625 X13 =.400

8 .000 .585 X2 =.446 .631 X19 =.474

9 .000 .591 X17 =.477 .641 X27 =.477

10 .000 .594 None .645 X11 =.481

11 .000 .655 none

The same process of checking assumptions was utilized for data in midterm and final

departmental results as shown in Table 3.1.2.

3.2 Data Format and Factor Analysis

Applying the same criteria above, Table 3.2.2 displays the percentage of variance

explained of the extracted components of the three data format of our data set, namely;

rubric, likert, and thurstone across departmental examinations.

30

Page 31: Exploratory data analysis using Math 1.7 data of CARSU

Examination on the table reveals that all data formats consistently extracts percentage of

variance greater than 60% across data sources.

Specifying the results, rubric extracts least number of components and highest percentage

of variance explained in prelim. On the other hand, likert and thurstone extracts least number of

components with least percentage of variance explained and highest percentage of variance

explained with highest number of components, respectively. This result leads us to choose rubric

in both midterm and final departmental examinations.

Based on the findings of the specified data, rubric is the most appropriate data format

suited for the application of EFA.

Table 3.2.2: Percentage of variance explained (%VE) of Extracted Components

(Co) on different data format of the unmastered competencies of College

Algebra across departmental examinations

Data source Rubric Likert Thurstone

Co %VE Co %VE Co %VE

1 17.651 1 16.306 1 12.307

31

Page 32: Exploratory data analysis using Math 1.7 data of CARSU

Prelim

2 14.063 2 13.337 2 10.526

3 11.062 3 8.251 3 9.752

4 8.452 4 7.368 4 8.214

5 8.266 5 7.244 5 6.822

6 7.036 6 6.821 6 6.486

7 6.305 7 5.713

8 5.429

Total %VE 66.531 65.631 65.248

Midterm

Rubric Likert Thurstone

Co %VE Co Co %VE Co

1 14.419 1 14.325 1 11.201

2 9.027 2 9.344 2 8.353

3 8.573 3 8.927 3 7.665

4 7.353 4 8.175 4 7.474

5 6.802 5 7.354 5 6.537

6 6.328 6 6.647 6 6.387

7 6.083 7 6.100 7 5.909

8 5.838 8 5.725

9 5.688

Total %VE 64.442 60.872 64.939

Final

Rubric Likert Thurstone

Co %VE Co Co %VE Co

1 9.613 1 11.944 1 8.991

2 8.658 2 10.482 2 7.477

3 7.358 3 8.198 3 6.554

4 6.301 4 6.807 4 6.301

5 5.631 5 6.241 5 5.377

6 5.213 6 6.228 6 5.272

7 4.983 7 5.796 7 4.920

8 4.840 8 5.714 8 4.714

9 4.609 9 4.631

10 4.437 10 4.298

11 4.156

Total %VE 61.664 61.410 62.690

3.3 Factor Rotations of EFA

32

Page 33: Exploratory data analysis using Math 1.7 data of CARSU

With the above procedures on rotation of components, Table 3.3.6 compares the two

types of factor rotation, namely; orthogonal versus oblique factor rotation. We observed that

variable compositions on the first component of both types of rotation are similar. Specifically,

in prelim, X16 and X18 are the variable composition of component 1; X10 and X12 in midterm; and in

final, component 1 is composed of X22 and X3.

To specify, in prelim, orthogonal rotation yields three components with factor loadings

relatively higher compared to oblique rotation since it extracts more number of components. On

the other hand, in midterm, the same number of components extracted but X10 in oblique rotation

has a higher factor loading compared to the first variable in component 1 of orthogonal rotation.

However, we noticed that oblique rotation does not give a consistent factor loading. Specifically,

X28 has a factor loading of .793 on component 4 which is higher compared to .787 of X19 of

component 2. This result does not give well-defined factor-loading pattern of our analysis. Same

observation is visible o final results, inconsistency of factor loadings is present on both types of

rotation. Moreover, orthogonal extracts more number of components than oblique rotation.

Thus, results imply that orthogonal rotation is a more appropriate type of component

rotation used for EFA.

Table 3.3.6: Orthogonal vs. Oblique Rotation across Data Sources

33

Page 34: Exploratory data analysis using Math 1.7 data of CARSU

Data Source

Co

Component Rotation

Prelim

Orthogonal rotation Oblique rotation

Var FL Var FL

1 X16

X18

.982

.981X18

X16

.973

.973

2 X4

X8

.900

.887X4

X8

X9

.875

.777

.670

3 X25

X26

.828

.794X25

X26

.776

.741

4 X15

X3

.707

.602

5 X5 .766

Midterm

Co Orthogonal rotation Oblique rotation

Var FL Var FL

1 X12

X10

.809

.804X10

X12

.819

.807

2 X26

X23

.753

.706X19

X2

.787

.667

3 X28

X29

.766

.721X27

X11

.714

.709

4 X7

X5

.744

.732X28

X29

.793

.708

5 X15

X3

.74

.738

X3

X16

.778

.714

Final

Co Orthogonal rotation Oblique rotation

Var FL Var FL

1 X3

X22

.962

.961X3

X22

.968

.967

2 X23

X16

X20

.812

.731

.633

X16

X23

.835

.831

3 X21

X7

.827

.670X8

X10

.764

.703

4 X10

X8

.737

.715X21

X25

.778

-.610

5 X5

X2

.788-668

X19

X2

-.685

-.681

6 X13 .938

7 X28 .921

34

Page 35: Exploratory data analysis using Math 1.7 data of CARSU

3.4 On Split-Sampling of Factor Analysis

By cross examinations of the

contents in Table 3.4.1, it reveals that across

different data sources, extracted components

of rubric are stable. That is, variables

composition on each component in 100%

data analysis remains stable across 60%-

40% split-sample. In prelim, composition of

variables in each component is strongly

stable across our split-sample. In midterm,

factors are stable but compositions of

variables interchange except on component

4. The same observation in component 3 and

4 in final, while component 2 only 60%

gives the same variable composition.

However, stability of the factors is not

affected as shown by shading in the Table

3.4.1. Hence, the use of EFA in rubric is

validated.

Table 3.3.1 Validation of EFA on Split-

sample Across Data Sources

Prelim Departmental Examination (n=864)

Co

Variables

100% 60% 40%

1 X16

X18

X18

X16

X18

X16

2 X4

X8

X4

X8

X8

X4

3 X25

X26

X25

X26

X25

X26

4 X14

X21

5 X3

X15

Midterm Departmental Examination (n=500)Co Variables

35

Page 36: Exploratory data analysis using Math 1.7 data of CARSU

1 X12

X10

X9

X24

X10

X12

X13

2 X26

X23

X12

X10

X29

X28

3 X28

X29

X16

X3

4 X7

X5

X7

X5

5 X15

X3

X26

X23

Final Departmental Examination (n=565) Co Variables

1 X3

X22

X22

X3

X3

X22

2 X23

X16

X20

X16

X23

X18

X19

3 X21

X7

X10

X8

X15

X24

4 X10

X8

X27

X6

X14

X5

5 X5

X2

X21

X5

X7

X26

6 X13 X25

X17

7 X28

Note: Shaded parts imply factor structure stability.

36

Page 37: Exploratory data analysis using Math 1.7 data of CARSU

Finally, Table 3.4.2 presents the reduced variables with its corresponding component together

with its deficiency indicator. The first variable in each component is the surrogate variable which

can be used for further analysis.

Table 3.4.2: Extracted Components with its Variables Composition and

Deficiency Indicator of Rubric Data in Math 1.7 Across Departmental

Examinations

Prelim Departmental Examination

Component Variable and Deficiency Indicator

1 Addition knowledge X16 Adding unlike termsX18 Incorrect application of DPMA

2 set operations X4 Not able to identify element/ s in the complementX8 Not able to identify element/s in the union

3 Incorrect Concepts X25 Incorrect graphing in complex planeX26 Incorrect squaring in polynomials

Midterm Departmental Examination

Component Variable and Deficiency Indicator

1 signed numbers operations

X12 Operations on integers

X10 Carelessness

2 division operation deficiency

X26 Identifying conjugateX23 Finding initial quotient

3 Mathematical expressions unmastered

X28 Expression on integerX29 Operations on radical expression

4 Exponential problems X7 Cancellation of unlike termsX5 Operations on exponents

5 Factoring problems X15 Use of grouping signX3 Factoring sum of cubes/ factoring trinomial

Final Departmental ExaminationComponent Variable and Deficiency Indicator

1 Simplification unmastered X3 Answer not in simplest formX22 Finding values of x and y in an equation

2 Graphing problems X23 Identification of vertexX16 Plotting the points in the graphX20 Finding the intersection

3 No idea X21 No solution shownX7 Carelessness

4 Carelessness X10 Copying the given problemX8 Correct solution, wrong final answer

5 Factoring deficiency X5 FactoringX2 Formula

6 LCD unmastered X13 Identification on LCD

7 Unit knowledge X28 No units

37

Page 38: Exploratory data analysis using Math 1.7 data of CARSU

Chapter 4

SUMMARY AND RECOMMENDATIONS

4.1 Summary of findings

1. Comparison of the three data format, namely; rubric, likert, and thurstone, results of EFA

shows that EFA is robust on rubric.

2. Comparison of the two types of factor rotation, that is; orthogonal versus oblique, reveals

that orthogonal rotation yields more consistent results of EFA.

3. Validation of EFA in rubric by split-sampling provides stability of the factor structure.

4.2 Recommendations

Based on the findings of the study, the following are recommended for subsequent

investigation:

1. Using the same data and quantification, apply EFA using Common Factor Analysis

(CFA) as the extraction method and EQUIMAX as the orthogonal rotation method.

2. Compare EFA results when using CFA with EQUIMAX in orthogonal rotation versus

CFA with OBLIMIN in oblique rotation.

3. Verify the conjecture of EFA with PCA and VARIMAX on orthogonal rotation that the

number of components extracted is directly proportional with the percentage of variance

explained on EFA with CFA using EQUIMAX in orthogonal rotation.

4. Determine which factor rotation is efficient for EFA using CFA as the extraction method.

38

Page 39: Exploratory data analysis using Math 1.7 data of CARSU

REFERENCES CITED

[1] Rietveld, T. & Van Hout, R. (1993). Statistical Techniques for the Study of Language and

Language Behaviour. Berlin – New York: Mouton de Gruyter.

[2] Darlington, R.B. (2004). Factor Analysis. Website:

http://comp9.psych.cornell.edu/Darlington/factor.htm (accessed 08 November 2012).

[3] Habing, B. (2003). Exploratory Factor Analysis. Website:

http://www.stat.sc.edu/~habing/courses/530EFA.pdf (accessed 08 November 2012).

[4] Diamantopoulos, A., and H. M. Winklhofer 2001. Index Construction with Formative

Indicators: An Alternative to Scale Development. Journal of Marketing Research 38 (May):

269-77

[5] Hair, Joseph H. Jr., and W.C. Black 2004. Multivariate Data Analysis.: pp.90-151

[6] Marsh, H. W., and S. Jackson. 1999. Flow Experience in Sport: Construct Validation of

Multidimensional Hierarchical State and Trait Responses. Structural Equations Modeling

6(4): 343-71

[7] Maison, U. Correlations in statistics.

39

Page 40: Exploratory data analysis using Math 1.7 data of CARSU

Website: http://www.socialresearchmethods.net/kb/statcorr.php

[8] Codaste, Ijean B. On some tests for Normality Numerical and Graphical Application,

2010.pp.4-8

[9] Landau, Sabine and Brian S. Everitt. A handbook of Statistical Analyses using SPSS, 2003

[10] Panik, Michael J. Advanced Statistics from an Elementary Point of View. 2005

[11] Miller and Freund’s. Probability and Statistics for Engineers.

[12] Data Quality and Assessment: Statistical Methods for Practitioners, [email protected]

40

Page 41: Exploratory data analysis using Math 1.7 data of CARSU

APPENDICES

Appendix A:

Table 1: Variables and its Description/Deficiency Indicator in Prelim Departmental Examination

Variables Deficiency IndicatorX1 No idea/no answerX2 Careless in plotting elements in Venn DiagramX3 Incorrect plotting of elements in the Venn DiagramX4 Not able to identify element/s in the complementX5 Not able to identify element/s in the intersectionX6 Not able to write proper notationX7 Careless in handling signs (+/-)X8 Not able to identify element/s in the unionX9 Not able to identify element/s in the set differenceX10 Missing some necessary step/s in the solutionX11 Not able to simplify combination of set operationsX12 Misinterpretation of problemX13 applying DPMA not the techniqueX14 Misinterpretation of an expression as equationX15 Not able to simplify operations with complex numbersX16 Adding unlike termsX17 Not able to divide terms in long division with polynomialsX18 Incorrect application of DPMAX19 Incorrect transpositionX20 Not able to simplify algebraic expressionX21 Not able to give the final answer but went through correct

solutionX22 Not able to identify conjugateX23 Not able to simplify the final answerX24 Careless in writing termsX25 Incorrect graphing in complex planeX26 Incorrect squaring of polynomials

41

Page 42: Exploratory data analysis using Math 1.7 data of CARSU

Table 2: Variables and its Description/Deficiency Indicator in Midterm Departmental Examination

Variables Deficiency IndicatorX1 No idea/no answerX2 Dividing terms/polynomials sumX3 Factoring sum of cubes/factoring trinomialX4 Assigning identification of LCDX5 Operations on exponentsX6 Not simplifying final answerX7 Cancellation of unlike termsX8 Operations on unlike terms/Adding terms of polynomialsX9 Solution not shownX10 CarelessnessX11 Inclusion of the coefficient in the variable exponentX12 Operations on integersX13 Laws of exponentsX14 Misuse of notationsX15 Use of grouping signsX16 Dividing exponential expression with same baseX17 Dividing exponential expressionX18 Operations on fractionsX19 Combining unlike base in exponential expressionX20 Identifying coefficients on dividend for syntheticX21 Subtracting 2nd row from the 1st in synthetic division X22 Wrong identification of divisorX23 Finding initial quotientX24 Express divisor as synthetic division (x-r)X25 Identifying terms on dividend long divisionX26 Identifying conjugateX27 Multiplying sum of radical of index 2X28 Expression on integerX29 Operations on radical expression

42

Page 43: Exploratory data analysis using Math 1.7 data of CARSU

Table 3: Variables and its Description/Deficiency Indicator in Final Departmental Examination

Variables Deficiency IndicatorX1 No idea/no answerX2 FormulaX3 Answer not in simplest formX4 Identification of coefficientX5 FactoringX6 Operation on integersX7 CarelessnessX8 Correct solution, wrong final answerX9 Operations on polynomialsX10 Copying the given problemX11 Operation on radical equationX12 Not using equality signX13 Identification of LCDX14 Identification of replacement setX15 Not simplifying final answerX16 Plotting the points in the graphX17 Use of method applyX18 No graph shownX19 Solution setX20 Finding the intersectionX21 No solution shown X22 Finding values of x and y in an equationX23 Identification of vertexX24 Misuse of equation in finding vertex X25 ComprehensionX26 No solution in finding consecutive integersX27 Use of formulaX28 No units

43

Page 44: Exploratory data analysis using Math 1.7 data of CARSU

44