basics of structural equation modeling

Basics of Structural Equation Modeling

Dr. Sean P. Mackinnon

Virtually every model you’ve done already using the Ordinary Least Squares approach (linear regression; uses sums of squares) can also be done using SEM

The difference is primarily how the parameters and SEs are calculated (SEM uses Maximum Likelihood

Estimation instead of Sums of Squares)

First, let’s get used to the notation of SEM diagrams

Correlation Coefficient

Depression Anxiety

.50

Rectangles indicate observed variables

Double-headed arrows indicate covariances

(so if standardized variables are used, it’s a pearson r)

Linear Regression

Depression Anxiety

.50

Single headed arrows are paths

In this example, depression is the IV and anxiety is the DV

IVs = exogenous variables (no arrows pointing to them)DVs = endogenous variables (arrows pointing to them)

Variances and Residual Variances

Depression Anxiety

.50

Exogenous variables also have a variance as a parameter

Endogenous variables have residual variance as a parameter (i.e., error; the portion of variance unexplained by model)

These are rarely drawn out explicitly in the diagrams, but worth remembering for later when we’re counting parameters and for more advanced applications.

Multiple Regression

Perfectionism Anxiety

.40

The correlations among DVs is specified in SPSS too

You just don’t get the output from it

R2 values often put in top right corner of DVs

Depression

SES

.26

-.11

.25

.09

.30

.01

Moderation

Perfectionism Depression

.40

Moderation is specified the same way as multiple regression

Only difference is that one of the variables is an interaction

Stress

Perfectionism * Stress

.26

-.11

.25

.09

.30

.01


Conflicta-path b-path

c’-path

Instead of a two-step process, it’s done all in one single analysis

If you want to get the c-path, run one more linear regression w/o the conflict variable included

Usually you’d use bootstrapping to test the indirect effect (a*b) in SEM

Mediation

Independent t-test

Sex Anxiety

B = 1.25

Sex is coded as 0 (women) or 1 (men)

Use unstandardized coefficents

The value for the intercept is mean for women

The value for the slope + intercept is value for men

If p-value for the slope < .05, the means are different

One-Way ANOVA (3 groups)

Treatment 1 (dummy)

Anxiety

Treatment 2 (dummy)

Original variable:1 = Control group; 2 = Treatment 1; 3 = Treatment 2

Treatment 1 (dummy): 1 = Treatment 1, 0 = other groupsTreatment 2 (dummy): 1 = Treatment 2, 0 = other groups

Similar to the t-test, you can get means for each group

This kind of dummy coding compares treatments to the control group

SEM can also address more complicated questions

Path Analysis

Complex relationships between variables can be used to test theory

Mackinnon et al. (2011)

Confirmatory Factor Analysis

Negative Affect

Anger Shame Sadness

Ovals represent latent variables

Paths are factor loadings in this diagram

Conceptually, this is like an EFA except you have an idea ahead of time about what items should comprise the latent variable

(and we can test hypotheses!)

Structural Equation ModelingLike path analysis, except looks at relationships among latent variables

Useful, because it accounts for the unreliability of measurement so it offers more un-biased parameters

Also lets you test virtually any theory you might have

Mackinnon et al. (2012)

Rules for Building Models

• Every path, correlation, and variance is a parameter

• The number of parameters cannot exceed the number of data points– If so, your model is under-identified, and can’t be

estimated using SEM

• Data points are calculated by:– p(p+1) / 2

– Where p = The number of observed variables

– Ex. with 3 variables: 3(4) / 2 = 6

A just-identified or “saturated” model


In this case, 4 variables:4*5 / 2 = 10 possible data points

Ten Parameters:4 variances

+6 covariances

Depression SES

So really, it’s a model where everything is related to everything else! Not very parsimonious

Another just-identified model



Six Parameters:3 variances

1 covariance2 paths

Depression

Note that the variances for endogenous variables will be residual variances (parts unexplained by the predictors)

More Parsimonious Models

Just identified models are interesting, but often not parsimonious (i.e., everything is related to everything)

Are there paths or covariances in your model that you can remove, but still end up with a well-fitting model?

Path analysis and SEM can answer these questions. When we fit models with fewer parameters than data points, we can see if the model is still a good “fit” with some paths omitted

An identified mediation model


Conflicta-path b-path

Fix to Zero


Five Parameters:3 variances

2 paths(path fixed to zero has been “freed”)

Can we remove the c’ path from this mediation model? This model is more parsimonious, so it would be preferred. Fit indices judge the adequacy of this model.

Model FitFit refers to the ability of a model to reproduce the data (i.e., usually the variance-covariance matrix).

1 2 3

1. Perfect 2.6

2. Conflict .40 5.2

3. Depression 0 .32 3.5

1 2 3

1. Perfect 2.5

2. Conflict .39 5.3

3. Depression .03 .40 3.1

Predicted by Model Actually observed in your data

So, in SEM we compare these matrices (model-created vs. actually observed in your data), and see how discrepant they are. If they are basically identical, the model “fits well”

Model Fit χ2

We condense these matrix comparisons into a SINGLE NUMBER:

Chi-square (χ2)df = (data points) – (estimated parameters)

It tests the null hypothesis that the model fits the data well (i.e., the model covariance matrix is very similar to the observed covariance matrix)

Thus, non-significant chi-squares are better!

Problems with χ2

Simulation studies show that the chi-square is TOO sensitive. It rejects models way more often than it should.

More importantly, it is tied to sample size. As sample size increases, the likelihood of a significant chi-square increases.

Thus, there is a very high Type II error rate, and it gets worse as sample size increases. Thus, we need alternative methods that account for this.

Incremental Fit Indices

Incremental fit indices compare your model to the fit of the baseline or “null” model:


ConflictFix to Zero Fix to Zero

Fix to Zero

The null model fixes all covariances and paths to be zeroSo, every variable is unrelated

Technically, the most parsimonious model, but not a useful one

Incremental Fit Indices

Confirmatory Fit Index (CFI)

d(Null Model) - d(Proposed Model)d(Null Model)

Let d = χ2 - df where df are the degrees of freedom of the model. If the index is greater than one, it is set at one and if less than zero, it is set to zero.

Values range from 0 (no fit) to 1.0 (perfect fit)

http://davidakenny.net/cm/fit.htm

Tucker-Lewis Index

Tucker-Lewis Index (TLI) Assigns a penalty for model complexity (prefers more parsimonious models).

χ2/df(Null Model) - χ2/df(Proposed Model)χ2/df(Null Model) – 1

Value range from 0 (no fit) to 1.0 (perfect fit)

The TLI is more conservative, will almost always reject more models than CFI


Parsimonious Indices

Root Mean Square Approximation of Error (RMSEA)

Similar to the others, except that it doesn’t actually compare to the null model, and (like TLI) offers a penalty for more complex models:

√(χ2 - df)

√[df(N - 1)]

Can also calculate a 90% CI for RMSEA


Absolute Indices

Standardized Root Mean Square Residual (SRMR)

The formula is kind of complicated, so conceptual understanding is better. This one uses the residuals.

The SRMR is an absolute measure of fit and is defined as the standardized difference between the observed correlation matrix and the predicted correlation matrix.

A value of 0 = perfect fit (i.e., residuals of zero)

The SRMR has no penalty for model complexity.


Fit Indices Cut-offs• χ2

– ideally non-significant, p > .01 or even p > .001• CFI and TLI

– Ideally greater than .95• RMSEA

– Ideally less than .06– Ideally, 90% CI for RMSEA doesn’t contain .08 or higher

• SRMR– Ideally less than .08

Citations for papers:Kline, R. B. (2011). Principles and practice of structural equation modeling (3rd ed.). New York, NY: Guilford.

Hooper, D., Coughlan, J., & Mullen, M. (2008). Structural equation modelling: guidelines for determining model fit. Electronic Journal of Business Research Methods, 6, 53-60.

A problem with latent variables


Seven Parameters:3 variances for observed vars

1 variance for LATENT variable3 paths (factor loadings)

This model can’t be estimated!

Also, the latent variable has no metric (what does a “1” on this latent variable even mean?

Negative Affect

Anger Shame Sadness


A solution:Fix the variance of the latent variable to 1. This frees up one parameter.

The latent variable becomes standardized with a mean of zero, and standard deviation of 1.

(Actually, all along we’ve been constraining the means to be zero to simplify the math “saturated mean structure”. Usually we don’t care

about the means for our theory so they aren’t explicitly modeled)

Constrain to be 1.0

Negative Affect

Anger Shame Sadness


An alternate solution:Fix one of the factor loadings (typically the one expected to have the

largest loading) to 1. This also frees up one parameter.

The latent variable will have the same variance as the observed variable that was constrained to be 1.0

Either solution works, and won’t affect fit indices

Constrain to be 1.0

Negative Affect

Anger Shame Sadness

Let’s try a sample analysis in R

A confirmatory factor analysis with 10 items and 1 latent variable (general self-esteem).

Install packages you’ll need

#For converting an SPSS file for R

install.packages(“foreign", dependencies = TRUE)

#For running structural equation modeling

install.packages("lavaan", dependencies = TRUE)

You only need to do this once ever (not every time you load R)

Get the SPSS file into R

#Load the foreign packagelibrary(foreign)

#Set working directory to where the dataset is located. This is also where you’ll save files. I’d create a new folder for this somewhere on your computersetwd("C:/Users/Sean Mackinnon/Desktop/R Analyses")

#Take the datafile and read it into R. This datafile will be henceforth called “lab9data” when working in Rlab9data <- read.spss("A4.selfesteem.sav", use.value.labels = TRUE, to.data.frame = TRUE)

Specify the model

#Load the lavaan package (only need to do once per time you open R)library(lavaan)

#Specify the model you’re testing, call that model “se.g.model1” (could call it anything)#By default, will constrain the first indicator to be 1.0se.g.model1 <-'se_g =~ se3 + se16r + se29 + se42r + se55 + se68 + se81r + se94 + se107r + se120r + se131 + se135r'

Fit the model

#Fit the data, call that fitted model “fit” (or anything you want)#Estimator = “MLR” is a robust estimator. I recommend always using this instead of the default.#missing = “ML” is to handle missing data using a full information maximum likelihood method#fixed.x = “TRUE” is optional. I include it because I want results to be similar to Mplus, which is another program I use often. See lavaan documentation for more info.

fit <- cfa(se.g.model1, data = lab9data, estimator = "MLR“, missing = “ML”, fixed.x = “TRUE”)

Request Output

#request the summary statistics to interpret

#In this case, I request fit indices and standardized values in addition to default output

summary(fit, fit.measures = TRUE, standardized = TRUE)

basics of structural equation modeling

Data & Analytics