Lab 02 — Path analysis & mediation (observed variables)

Author

Tommaso Feraco

Goals

In this lab you will:

Fit a simple mediation model as a system of regressions in lavaan
Compute direct, indirect, and total effects via defined parameters
Compare delta-method vs bootstrap inference for the indirect effect
Test a theoretically motivated constraint (full mediation: (c’ = 0))
Extend to covariates and parallel mediators
(Optional, advanced) See a concrete example of equivalent just-identified models

Setup

library(lavaan)

Part A — Simulate a mediation dataset (so we know the “truth”)

We start with a controlled world:

\[ \begin{aligned} M &= aX + \varepsilon_M \\ Y &= c'X + bM + \varepsilon_Y \end{aligned} \qquad \text{Indirect}=ab,\ \text{Total}=c'+ab \]

We generate data with:

\((a = 0.50)\), \((b = 0.60)\), \((c' = 0.10)\)
\((\varepsilon_M, \varepsilon_Y)\) independent, mean 0

Show code

N <- 400

a <- 0.50
b <- 0.60
cprime <- 0.10

X <- rnorm(N, 0, 1)
eM <- rnorm(N, 0, 1)
eY <- rnorm(N, 0, 1)

M <- a*X + eM
Y <- cprime*X + b*M + eY

dat <- data.frame(X, M, Y)
summary(dat)

       X                  M                 Y          
 Min.   :-3.39606   Min.   :-3.3910   Min.   :-4.0705  
 1st Qu.:-0.64903   1st Qu.:-0.7711   1st Qu.:-0.8969  
 Median :-0.02071   Median :-0.0589   Median :-0.0280  
 Mean   : 0.00775   Mean   :-0.0580   Mean   :-0.0693  
 3rd Qu.: 0.65036   3rd Qu.: 0.6764   3rd Qu.: 0.7699  
 Max.   : 3.04377   Max.   : 3.5447   Max.   : 3.1250

Using simulated data is not “making it easier”: it lets you verify whether estimation and inference behave as theory predicts.

Exercise 1 — Fit mediation in lavaan (delta method)

1a) Write the model

Label the paths \((a)\), \((b)\), \((c')\)
Define ind := a*b and tot := cprime + a*b

Show code

mod_med <- '
  M ~ a*X
  Y ~ cprime*X + b*M

  ind := a*b
  tot := cprime + (a*b)
'

1b) Fit + inspect

Show code

fit_med <- sem(mod_med, data = dat, meanstructure = TRUE)
summary(fit_med, standardized = TRUE)

lavaan 0.6-19 ended normally after 1 iteration

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                         7

  Number of observations                           400

Model Test User Model:
                                                      
  Test statistic                                 0.000
  Degrees of freedom                                 0

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Expected
  Information saturated (h1) model          Structured

Regressions:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  M ~                                                                   
    X          (a)    0.463    0.049    9.521    0.000    0.463    0.430
  Y ~                                                                   
    X       (cprm)   -0.010    0.052   -0.185    0.853   -0.010   -0.008
    M          (b)    0.659    0.048   13.604    0.000    0.659    0.603

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .M                -0.062    0.049   -1.248    0.212   -0.062   -0.056
   .Y                -0.031    0.048   -0.648    0.517   -0.031   -0.026

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .M                 0.973    0.069   14.142    0.000    0.973    0.815
   .Y                 0.914    0.065   14.142    0.000    0.914    0.641

Defined Parameters:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
    ind               0.305    0.039    7.800    0.000    0.305    0.259
    tot               0.296    0.057    5.185    0.000    0.296    0.251

1c) Extract only the pieces you need

Show code

pe <- parameterEstimates(fit_med, standardized = TRUE)
pe[pe$op %in% c("~",":="), c("lhs","op","rhs","label","est","se","z","pvalue","std.all")]

   lhs op          rhs  label    est    se      z pvalue std.all
1    M  ~            X      a  0.463 0.049  9.521  0.000   0.430
2    Y  ~            X cprime -0.010 0.052 -0.185  0.853  -0.008
3    Y  ~            M      b  0.659 0.048 13.604  0.000   0.603
10 ind :=          a*b    ind  0.305 0.039  7.800  0.000   0.259
11 tot := cprime+(a*b)    tot  0.296 0.057  5.185  0.000   0.251

Questions (answer in words)

Are the estimates close to the generating values \((a=0.50)\), \((b=0.60)\), \((c'=0.10)\)?
What is your estimate of the indirect effect \((ab)\)? Is it close to \((0.30)\)?
What does lavaan use (by default) to get the SE for ind := a*b?

Exercise 2 — Bootstrap inference for the indirect effect

Indirect effects are products; their sampling distribution is often skewed. Let’s bootstrap.

Show code

fit_med_boot <- sem(
  mod_med, data = dat,
  se = "bootstrap", bootstrap = 2000,
  meanstructure = TRUE
)

peb <- parameterEstimates(fit_med_boot, ci = TRUE, level = .95,
                          standardized = TRUE)
peb[peb$op %in% c("~",":="), c("lhs","op","rhs","label","est","se","ci.lower","ci.upper","std.all")]

   lhs op          rhs  label    est    se ci.lower ci.upper std.all
1    M  ~            X      a  0.463 0.047    0.376    0.561   0.430
2    Y  ~            X cprime -0.010 0.058   -0.125    0.105  -0.008
3    Y  ~            M      b  0.659 0.048    0.563    0.753   0.603
10 ind :=          a*b    ind  0.305 0.039    0.237    0.387   0.259
11 tot := cprime+(a*b)    tot  0.296 0.060    0.176    0.414   0.251

Questions

Compare the SE for ind under delta method (Exercise 1) vs bootstrap (Exercise 2). Are they similar?
Is the bootstrap CI symmetric around the estimate? Should you expect symmetry?
In real datasets, when would you strongly prefer bootstrap for indirect effects?

Exercise 3 — Test “full mediation” \((c' = 0)\)

Here we compare:

Partial mediation: \((c')\) free (your current model)
Full mediation: constrain \((c' = 0)\)

Show code

mod_full <- '
  M ~ a*X
  Y ~ 0*X + b*M   # cprime fixed to 0

  ind := a*b
  tot := (a*b)
'
fit_full <- sem(mod_full, data = dat, meanstructure = TRUE)

Compare nested models (LRT)

Show code

anova(fit_med, fit_full)


Chi-Squared Difference Test

         Df  AIC  BIC Chisq Chisq diff RMSEA Df diff Pr(>Chisq)
fit_med   0 2237 2265  0.00                                    
fit_full  1 2235 2259  0.03     0.0343     0       1       0.85

Interpretation questions

What does a significant χ² difference imply here?
In this simulated world (true \((c'=0.10)\)), should you expect full mediation to be rejected as \((N)\) grows? Why?
In your substantive area, what theoretical arguments would justify fixing \((c'=0)\) (if any)?

Part B — Add a covariate (confounding intuition)

Now we simulate a covariate \((Z)\) that affects both \((M)\) and \((Y)\). Think of \((Z)\) as a plausible confounder.

\[ \begin{aligned} M &= aX + dZ + \varepsilon_M\\ Y &= c'X + bM + eZ + \varepsilon_Y \end{aligned} \]

Show code

set.seed(1235)

Z <- rnorm(N, 0, 1)
d <- 0.60
e <- 0.50

M2 <- a*X + d*Z + rnorm(N, 0, 1)
Y2 <- cprime*X + b*M2 + e*Z + rnorm(N, 0, 1)

dat2 <- data.frame(X, Z, M = M2, Y = Y2)

Exercise 4 — Mediation with covariates

4a) Fit without Z (misspecified on purpose)

Show code

fit_noZ <- sem(mod_med, data = dat2, meanstructure = TRUE)
parameterEstimates(fit_noZ)[parameterEstimates(fit_noZ)$op %in% c("~",":="),
                            c("lhs","op","rhs","label","est","se","pvalue")]

   lhs op          rhs  label    est    se pvalue
1    M  ~            X      a  0.513 0.057  0.000
2    Y  ~            X cprime -0.020 0.063  0.758
3    Y  ~            M      b  0.846 0.050  0.000
10 ind :=          a*b    ind  0.434 0.055  0.000
11 tot := cprime+(a*b)    tot  0.415 0.075  0.000

4b) Fit with Z included

Show code

mod_withZ <- '
  M ~ a*X + d*Z
  Y ~ cprime*X + b*M + e*Z

  ind := a*b
  tot := cprime + (a*b)
'
fit_withZ <- sem(mod_withZ, data = dat2, meanstructure = TRUE)
parameterEstimates(fit_withZ)[parameterEstimates(fit_withZ)$op %in% c("~",":="),
                              c("lhs","op","rhs","label","est","se","pvalue")]

   lhs op          rhs  label   est    se pvalue
1    M  ~            X      a 0.500 0.049    0.0
2    M  ~            Z      d 0.578 0.049    0.0
3    Y  ~            X cprime 0.097 0.059    0.1
4    Y  ~            M      b 0.594 0.053    0.0
5    Y  ~            Z      e 0.558 0.060    0.0
15 ind :=          a*b    ind 0.297 0.040    0.0
16 tot := cprime+(a*b)    tot 0.394 0.060    0.0

Questions

Compare \((b)\) and \((c')\) from fit_noZ vs fit_withZ. Which one looks more biased and why?
In real research, why is “include covariates” not a magic causal fix?
How would you argue for including a covariate in a preregistered SEM?

Part C — Parallel mediators (still observed variables)

Now simulate two mediators \((M_1)\) and \((M_2)\) in parallel:

\[ \begin{aligned} M_1 &= a_1X + \varepsilon_1\\ M_2 &= a_2X + \varepsilon_2\\ Y &= c'X + b_1M_1 + b_2M_2 + \varepsilon_Y \end{aligned} \]

Show code

set.seed(1236)

a1 <- 0.50; b1 <- 0.60
a2 <- 0.30; b2 <- 0.40
cprime <- 0.10

X <- rnorm(N)
M1 <- a1*X + rnorm(N)
M2 <- a2*X + rnorm(N)
Y  <- cprime*X + b1*M1 + b2*M2 + rnorm(N)

dat3 <- data.frame(X, M1, M2, Y)

Exercise 5 — Fit parallel mediation + compare indirect paths

Show code

mod_par <- '
  M1 ~ a1*X
  M2 ~ a2*X
  Y  ~ cprime*X + b1*M1 + b2*M2

  M1 ~~ M2   # allow mediator covariance (often realistic)

  ind1 := a1*b1
  ind2 := a2*b2
  indT := ind1 + ind2
  tot  := cprime + indT
'

fit_par <- sem(mod_par, data = dat3, se = "bootstrap", bootstrap = 2000,
               meanstructure = TRUE)
pe3 <- parameterEstimates(fit_par, ci = TRUE, standardized = TRUE)
pe3[pe3$op %in% c("~",":="), c("lhs","op","rhs","label","est","ci.lower","ci.upper","std.all")]

    lhs op         rhs  label   est ci.lower ci.upper std.all
1    M1  ~           X     a1 0.517    0.421    0.618   0.447
2    M2  ~           X     a2 0.303    0.193    0.406   0.274
3     Y  ~           X cprime 0.140    0.024    0.267   0.103
4     Y  ~          M1     b1 0.556    0.456    0.653   0.473
5     Y  ~          M2     b2 0.343    0.241    0.444   0.279
15 ind1 :=       a1*b1   ind1 0.288    0.218    0.364   0.212
16 ind2 :=       a2*b2   ind2 0.104    0.060    0.155   0.076
17 indT :=   ind1+ind2   indT 0.392    0.307    0.480   0.288
18  tot := cprime+indT    tot 0.531    0.416    0.654   0.391

Questions

Which indirect effect is larger, ind1 or ind2? Does that match the generating values?
Does allowing M1 ~~ M2 change \((b_1)\) and \((b_2)\) much? Why might it in real data?
What would you need (design/assumptions) to interpret the two mediators causally?

Optional advanced exercise — Equivalent just-identified models

With only \((X,M,Y)\) and a mediation structure, the model is often just-identified \((df=0)\).

That means global fit cannot distinguish among different directional stories.

Fit an alternative model where Y predicts M (reversing the middle relation), still allowing a direct X → Y path:

\[ \begin{aligned} Y &= i_Y + \alpha X + \varepsilon_Y\\ M &= i_M + \gamma X + \beta Y + \varepsilon_M \end{aligned} \]

Show code

mod_alt <- '
  Y ~ alpha*X
  M ~ gamma*X + beta*Y

  ind_alt := alpha*beta
'
fit_alt <- sem(mod_alt, data = dat, meanstructure = TRUE)

c(
  df_med = fitMeasures(fit_med, "df"),
  df_alt = fitMeasures(fit_alt, "df"),
  chisq_med = fitMeasures(fit_med, "chisq"),
  chisq_alt = fitMeasures(fit_alt, "chisq")
)

      df_med.df       df_alt.df chisq_med.chisq chisq_alt.chisq 
              0               0               0               0

Questions

Are both models \((df=0)\)? What does that imply about χ² and other global fit indices?
If both fit “perfectly”, how would you choose between them scientifically?
What kinds of data/design would help (time, experiments, instruments, measurement)?

If reversing a path makes it significant, the model is not better or more true

Wrap-up

Take-home (lab)

Mediation is a system of equations; the indirect effect is a defined parameter \((ab)\)
For \((ab)\), bootstrap CIs are often more trustworthy than normal-theory tests
Testing “full mediation” is a constraint test—not a goal by itself
Covariates help, but causal claims still require assumptions and design

Solutions (instructor version)