library(lavaan)Lab 02 — Path analysis & mediation (observed variables)
Goals
In this lab you will:
- Fit a simple mediation model as a system of regressions in lavaan
- Compute direct, indirect, and total effects via defined parameters
- Compare delta-method vs bootstrap inference for the indirect effect
- Test a theoretically motivated constraint (full mediation: (c’ = 0))
- Extend to covariates and parallel mediators
- (Optional, advanced) See a concrete example of equivalent just-identified models
Setup
Part A — Simulate a mediation dataset (so we know the “truth”)
We start with a controlled world:
\[ \begin{aligned} M &= aX + \varepsilon_M \\ Y &= c'X + bM + \varepsilon_Y \end{aligned} \qquad \text{Indirect}=ab,\ \text{Total}=c'+ab \]
We generate data with:
- \((a = 0.50)\), \((b = 0.60)\), \((c' = 0.10)\)
- \((\varepsilon_M, \varepsilon_Y)\) independent, mean 0
Show code
N <- 400
a <- 0.50
b <- 0.60
cprime <- 0.10
X <- rnorm(N, 0, 1)
eM <- rnorm(N, 0, 1)
eY <- rnorm(N, 0, 1)
M <- a*X + eM
Y <- cprime*X + b*M + eY
dat <- data.frame(X, M, Y)
summary(dat) X M Y
Min. :-3.39606 Min. :-3.3910 Min. :-4.0705
1st Qu.:-0.64903 1st Qu.:-0.7711 1st Qu.:-0.8969
Median :-0.02071 Median :-0.0589 Median :-0.0280
Mean : 0.00775 Mean :-0.0580 Mean :-0.0693
3rd Qu.: 0.65036 3rd Qu.: 0.6764 3rd Qu.: 0.7699
Max. : 3.04377 Max. : 3.5447 Max. : 3.1250
Exercise 1 — Fit mediation in lavaan (delta method)
1a) Write the model
- Label the paths \((a)\), \((b)\), \((c')\)
- Define
ind := a*bandtot := cprime + a*b
Show code
mod_med <- '
M ~ a*X
Y ~ cprime*X + b*M
ind := a*b
tot := cprime + (a*b)
'1b) Fit + inspect
Show code
fit_med <- sem(mod_med, data = dat, meanstructure = TRUE)
summary(fit_med, standardized = TRUE)lavaan 0.6-19 ended normally after 1 iteration
Estimator ML
Optimization method NLMINB
Number of model parameters 7
Number of observations 400
Model Test User Model:
Test statistic 0.000
Degrees of freedom 0
Parameter Estimates:
Standard errors Standard
Information Expected
Information saturated (h1) model Structured
Regressions:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
M ~
X (a) 0.463 0.049 9.521 0.000 0.463 0.430
Y ~
X (cprm) -0.010 0.052 -0.185 0.853 -0.010 -0.008
M (b) 0.659 0.048 13.604 0.000 0.659 0.603
Intercepts:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.M -0.062 0.049 -1.248 0.212 -0.062 -0.056
.Y -0.031 0.048 -0.648 0.517 -0.031 -0.026
Variances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.M 0.973 0.069 14.142 0.000 0.973 0.815
.Y 0.914 0.065 14.142 0.000 0.914 0.641
Defined Parameters:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
ind 0.305 0.039 7.800 0.000 0.305 0.259
tot 0.296 0.057 5.185 0.000 0.296 0.251
1c) Extract only the pieces you need
Show code
pe <- parameterEstimates(fit_med, standardized = TRUE)
pe[pe$op %in% c("~",":="), c("lhs","op","rhs","label","est","se","z","pvalue","std.all")] lhs op rhs label est se z pvalue std.all
1 M ~ X a 0.463 0.049 9.521 0.000 0.430
2 Y ~ X cprime -0.010 0.052 -0.185 0.853 -0.008
3 Y ~ M b 0.659 0.048 13.604 0.000 0.603
10 ind := a*b ind 0.305 0.039 7.800 0.000 0.259
11 tot := cprime+(a*b) tot 0.296 0.057 5.185 0.000 0.251
Questions (answer in words)
- Are the estimates close to the generating values \((a=0.50)\), \((b=0.60)\), \((c'=0.10)\)?
- What is your estimate of the indirect effect \((ab)\)? Is it close to \((0.30)\)?
- What does lavaan use (by default) to get the SE for
ind := a*b?
Exercise 2 — Bootstrap inference for the indirect effect
Indirect effects are products; their sampling distribution is often skewed. Let’s bootstrap.
Show code
fit_med_boot <- sem(
mod_med, data = dat,
se = "bootstrap", bootstrap = 2000,
meanstructure = TRUE
)
peb <- parameterEstimates(fit_med_boot, ci = TRUE, level = .95,
standardized = TRUE)
peb[peb$op %in% c("~",":="), c("lhs","op","rhs","label","est","se","ci.lower","ci.upper","std.all")] lhs op rhs label est se ci.lower ci.upper std.all
1 M ~ X a 0.463 0.047 0.376 0.561 0.430
2 Y ~ X cprime -0.010 0.058 -0.125 0.105 -0.008
3 Y ~ M b 0.659 0.048 0.563 0.753 0.603
10 ind := a*b ind 0.305 0.039 0.237 0.387 0.259
11 tot := cprime+(a*b) tot 0.296 0.060 0.176 0.414 0.251
Questions
- Compare the SE for
indunder delta method (Exercise 1) vs bootstrap (Exercise 2). Are they similar? - Is the bootstrap CI symmetric around the estimate? Should you expect symmetry?
- In real datasets, when would you strongly prefer bootstrap for indirect effects?
Exercise 3 — Test “full mediation” \((c' = 0)\)
Here we compare:
- Partial mediation: \((c')\) free (your current model)
- Full mediation: constrain \((c' = 0)\)
Show code
mod_full <- '
M ~ a*X
Y ~ 0*X + b*M # cprime fixed to 0
ind := a*b
tot := (a*b)
'
fit_full <- sem(mod_full, data = dat, meanstructure = TRUE)Compare nested models (LRT)
Show code
anova(fit_med, fit_full)
Chi-Squared Difference Test
Df AIC BIC Chisq Chisq diff RMSEA Df diff Pr(>Chisq)
fit_med 0 2237 2265 0.00
fit_full 1 2235 2259 0.03 0.0343 0 1 0.85
Interpretation questions
- What does a significant χ² difference imply here?
- In this simulated world (true \((c'=0.10)\)), should you expect full mediation to be rejected as \((N)\) grows? Why?
- In your substantive area, what theoretical arguments would justify fixing \((c'=0)\) (if any)?
Part B — Add a covariate (confounding intuition)
Now we simulate a covariate \((Z)\) that affects both \((M)\) and \((Y)\). Think of \((Z)\) as a plausible confounder.
\[ \begin{aligned} M &= aX + dZ + \varepsilon_M\\ Y &= c'X + bM + eZ + \varepsilon_Y \end{aligned} \]
Show code
set.seed(1235)
Z <- rnorm(N, 0, 1)
d <- 0.60
e <- 0.50
M2 <- a*X + d*Z + rnorm(N, 0, 1)
Y2 <- cprime*X + b*M2 + e*Z + rnorm(N, 0, 1)
dat2 <- data.frame(X, Z, M = M2, Y = Y2)Exercise 4 — Mediation with covariates
4a) Fit without Z (misspecified on purpose)
Show code
fit_noZ <- sem(mod_med, data = dat2, meanstructure = TRUE)
parameterEstimates(fit_noZ)[parameterEstimates(fit_noZ)$op %in% c("~",":="),
c("lhs","op","rhs","label","est","se","pvalue")] lhs op rhs label est se pvalue
1 M ~ X a 0.513 0.057 0.000
2 Y ~ X cprime -0.020 0.063 0.758
3 Y ~ M b 0.846 0.050 0.000
10 ind := a*b ind 0.434 0.055 0.000
11 tot := cprime+(a*b) tot 0.415 0.075 0.000
4b) Fit with Z included
Show code
mod_withZ <- '
M ~ a*X + d*Z
Y ~ cprime*X + b*M + e*Z
ind := a*b
tot := cprime + (a*b)
'
fit_withZ <- sem(mod_withZ, data = dat2, meanstructure = TRUE)
parameterEstimates(fit_withZ)[parameterEstimates(fit_withZ)$op %in% c("~",":="),
c("lhs","op","rhs","label","est","se","pvalue")] lhs op rhs label est se pvalue
1 M ~ X a 0.500 0.049 0.0
2 M ~ Z d 0.578 0.049 0.0
3 Y ~ X cprime 0.097 0.059 0.1
4 Y ~ M b 0.594 0.053 0.0
5 Y ~ Z e 0.558 0.060 0.0
15 ind := a*b ind 0.297 0.040 0.0
16 tot := cprime+(a*b) tot 0.394 0.060 0.0
Questions
- Compare \((b)\) and \((c')\) from
fit_noZvsfit_withZ. Which one looks more biased and why? - In real research, why is “include covariates” not a magic causal fix?
- How would you argue for including a covariate in a preregistered SEM?
Part C — Parallel mediators (still observed variables)
Now simulate two mediators \((M_1)\) and \((M_2)\) in parallel:
\[ \begin{aligned} M_1 &= a_1X + \varepsilon_1\\ M_2 &= a_2X + \varepsilon_2\\ Y &= c'X + b_1M_1 + b_2M_2 + \varepsilon_Y \end{aligned} \]
Show code
set.seed(1236)
a1 <- 0.50; b1 <- 0.60
a2 <- 0.30; b2 <- 0.40
cprime <- 0.10
X <- rnorm(N)
M1 <- a1*X + rnorm(N)
M2 <- a2*X + rnorm(N)
Y <- cprime*X + b1*M1 + b2*M2 + rnorm(N)
dat3 <- data.frame(X, M1, M2, Y)Exercise 5 — Fit parallel mediation + compare indirect paths
Show code
mod_par <- '
M1 ~ a1*X
M2 ~ a2*X
Y ~ cprime*X + b1*M1 + b2*M2
M1 ~~ M2 # allow mediator covariance (often realistic)
ind1 := a1*b1
ind2 := a2*b2
indT := ind1 + ind2
tot := cprime + indT
'
fit_par <- sem(mod_par, data = dat3, se = "bootstrap", bootstrap = 2000,
meanstructure = TRUE)
pe3 <- parameterEstimates(fit_par, ci = TRUE, standardized = TRUE)
pe3[pe3$op %in% c("~",":="), c("lhs","op","rhs","label","est","ci.lower","ci.upper","std.all")] lhs op rhs label est ci.lower ci.upper std.all
1 M1 ~ X a1 0.517 0.421 0.618 0.447
2 M2 ~ X a2 0.303 0.193 0.406 0.274
3 Y ~ X cprime 0.140 0.024 0.267 0.103
4 Y ~ M1 b1 0.556 0.456 0.653 0.473
5 Y ~ M2 b2 0.343 0.241 0.444 0.279
15 ind1 := a1*b1 ind1 0.288 0.218 0.364 0.212
16 ind2 := a2*b2 ind2 0.104 0.060 0.155 0.076
17 indT := ind1+ind2 indT 0.392 0.307 0.480 0.288
18 tot := cprime+indT tot 0.531 0.416 0.654 0.391
Questions
- Which indirect effect is larger,
ind1orind2? Does that match the generating values? - Does allowing
M1 ~~ M2change \((b_1)\) and \((b_2)\) much? Why might it in real data? - What would you need (design/assumptions) to interpret the two mediators causally?
Optional advanced exercise — Equivalent just-identified models
With only \((X,M,Y)\) and a mediation structure, the model is often just-identified \((df=0)\).
That means global fit cannot distinguish among different directional stories.
Fit an alternative model where Y predicts M (reversing the middle relation), still allowing a direct X → Y path:
\[ \begin{aligned} Y &= i_Y + \alpha X + \varepsilon_Y\\ M &= i_M + \gamma X + \beta Y + \varepsilon_M \end{aligned} \]
Show code
mod_alt <- '
Y ~ alpha*X
M ~ gamma*X + beta*Y
ind_alt := alpha*beta
'
fit_alt <- sem(mod_alt, data = dat, meanstructure = TRUE)
c(
df_med = fitMeasures(fit_med, "df"),
df_alt = fitMeasures(fit_alt, "df"),
chisq_med = fitMeasures(fit_med, "chisq"),
chisq_alt = fitMeasures(fit_alt, "chisq")
) df_med.df df_alt.df chisq_med.chisq chisq_alt.chisq
0 0 0 0
Questions
- Are both models \((df=0)\)? What does that imply about χ² and other global fit indices?
- If both fit “perfectly”, how would you choose between them scientifically?
- What kinds of data/design would help (time, experiments, instruments, measurement)?
Wrap-up
Take-home (lab)
- Mediation is a system of equations; the indirect effect is a defined parameter \((ab)\)
- For \((ab)\), bootstrap CIs are often more trustworthy than normal-theory tests
- Testing “full mediation” is a constraint test—not a goal by itself
- Covariates help, but causal claims still require assumptions and design