Lab 08 — Ordinal CFA/SEM: thresholds, WLSMV, and ordinal invariance

Author

Tommaso Feraco

Goals

In this lab you will:

Create an ordinal version of a familiar CFA dataset
Fit a CFA treating items as continuous (wrong-on-purpose) vs ordinal (WLSMV)
Interpret thresholds and why ordinal models replace intercepts with thresholds
Use CFA diagnostics (residuals + MI/EPC) in the ordinal setting
Test a simple ordinal measurement invariance ladder across groups

Setup

Show code

library(lavaan)
library(semTools)

Data: Holzinger & Swineford (1939), made ordinal

We use the classic HolzingerSwineford1939 dataset and discretize the indicators into 5 ordered categories (a “Likert-ification”).

Show code

dat_full <- HolzingerSwineford1939
vars <- paste0("x", 1:9)

make_ordinal <- function(x, k = 5) {
  # quantile-based cutpoints (balanced categories)
  qs <- quantile(x, probs = seq(0, 1, length.out = k + 1), na.rm = TRUE, type = 8)
  qs <- unique(qs)
  if (length(qs) < 3) {
    qs <- seq(min(x, na.rm = TRUE), max(x, na.rm = TRUE), length.out = k + 1)
  }
  cut(x, breaks = qs, include.lowest = TRUE, ordered_result = TRUE)
}

dat_ord <- dat_full
for (v in vars) dat_ord[[v]] <- make_ordinal(dat_full[[v]], k = 5)

# A "wrong-on-purpose" version that treats ordinal codes as continuous numbers
dat_ord_num <- dat_ord
for (v in vars) dat_ord_num[[v]] <- as.numeric(dat_ord[[v]])

# sanity check
lapply(dat_ord[vars], table) |> head(2)

$x1

  [0.667,4]    (4,4.67] (4.67,5.33]  (5.33,5.9]   (5.9,8.5] 
         65          60          72          44          60 

$x2

[2.25,5.25] (5.25,5.75] (5.75,6.25]    (6.25,7]    (7,9.25] 
         90          54          53          53          51

Measurement model (same as CFA lab)

Show code

m3 <- '
  visual  =~ x1 + x2 + x3
  textual =~ x4 + x5 + x6
  speed   =~ x7 + x8 + x9
'

Exercise 1 — Fit CFA treating ordinal as continuous vs ordinal as ordinal

1a) Continuous (ML) on ordinal codes (wrong-on-purpose)

Show code

fit_cont <- cfa(m3, data = dat_ord_num, std.lv = TRUE)  # ML default
fitMeasures(fit_cont, c("chisq","df","cfi","tli","rmsea","srmr"))

 chisq     df    cfi    tli  rmsea   srmr 
79.046 24.000  0.928  0.891  0.087  0.060

1b) Ordinal (WLSMV) with thresholds

Show code

fit_ord <- cfa(
  m3,
  data = dat_ord,
  ordered = vars,              # tells lavaan these are ordinal
  std.lv = TRUE,
  parameterization = "theta"   # common choice for categorical indicators
  # estimator is set automatically to WLSMV when ordered= is used
)

fitMeasures(fit_ord, c("chisq","df","cfi","tli","rmsea","srmr"))

 chisq     df    cfi    tli  rmsea   srmr 
56.708 24.000  0.987  0.980  0.067  0.065

Questions (answer in words)

Which fit indices change the most between the two analyses?
Do standardized loadings change meaningfully? Why might they?
What is the conceptual mistake of treating ordinal codes as if they were continuous?

Exercise 2 — Compare key parameters (loadings + factor correlations)

Show code

pe_cont <- parameterEstimates(fit_cont, standardized = TRUE)
pe_ord  <- parameterEstimates(fit_ord,  standardized = TRUE)

load_cont <- subset(pe_cont, op == "=~")[, c("lhs","rhs","est","se","pvalue","std.all")]
load_ord  <- subset(pe_ord,  op == "=~")[, c("lhs","rhs","est","se","pvalue","std.all")]

phi_cont <- subset(pe_cont, op == "~~" & lhs %in% c("visual","textual","speed") & rhs %in% c("visual","textual","speed") & lhs != rhs)[, c("lhs","rhs","est","std.all")]
phi_ord  <- subset(pe_ord,  op == "~~" & lhs %in% c("visual","textual","speed") & rhs %in% c("visual","textual","speed") & lhs != rhs)[, c("lhs","rhs","est","std.all")]

list(load_cont = load_cont, load_ord = load_ord,
     phi_cont = phi_cont, phi_ord = phi_ord)

$load_cont
      lhs rhs   est    se pvalue std.all
1  visual  x1 1.060 0.097      0   0.750
2  visual  x2 0.634 0.097      0   0.432
3  visual  x3 0.815 0.094      0   0.568
4 textual  x4 1.056 0.070      0   0.774
5 textual  x5 1.178 0.070      0   0.844
6 textual  x6 1.228 0.072      0   0.849
7   speed  x7 0.689 0.095      0   0.486
8   speed  x8 0.837 0.096      0   0.596
9   speed  x9 1.032 0.102      0   0.729

$load_ord
      lhs rhs   est    se pvalue std.all
1  visual  x1 1.520 0.394      0   0.835
2  visual  x2 0.495 0.088      0   0.443
3  visual  x3 0.674 0.106      0   0.559
4 textual  x4 1.472 0.160      0   0.827
5 textual  x5 1.819 0.226      0   0.876
6 textual  x6 1.767 0.195      0   0.870
7   speed  x7 0.624 0.096      0   0.529
8   speed  x8 0.820 0.121      0   0.634
9   speed  x9 1.430 0.308      0   0.820

$phi_cont
       lhs     rhs   est std.all
22  visual textual 0.488   0.488
23  visual   speed 0.516   0.516
24 textual   speed 0.235   0.235

$phi_ord
       lhs     rhs   est std.all
58  visual textual 0.490   0.490
59  visual   speed 0.491   0.491
60 textual   speed 0.232   0.232

Questions

Are factor correlations higher/lower under ordinal modeling? Why can that happen?
Which indicators look weakest in standardized terms? Is that stable across estimators?

Exercise 3 — Thresholds: what replaces intercepts?

For ordinal indicators, the model is expressed in terms of an underlying latent response \((y^\*)\). Observed categories reflect whether \((y^\*)\) crosses thresholds:

category 1 if \((y^\* \le \tau_1)\)
category 2 if \((\tau_1 < y^\* \le \tau_2)\)
…
category K if \((y^\* > \tau_{K-1})\)

In lavaan, thresholds appear with op == "|".

Show code

thr <- subset(pe_ord, op == "|")[, c("lhs","op","rhs","est","se")]
head(thr, 18)

   lhs op rhs    est    se
10  x1  |  t1 -1.430 0.280
11  x1  |  t2 -0.389 0.147
12  x1  |  t3  0.723 0.189
13  x1  |  t4  1.536 0.315
14  x2  |  t1 -0.588 0.086
15  x2  |  t2 -0.060 0.081
16  x2  |  t3  0.443 0.084
17  x2  |  t4  1.067 0.097
18  x3  |  t1 -0.868 0.100
19  x3  |  t2 -0.289 0.089
20  x3  |  t3  0.383 0.090
21  x3  |  t4  1.047 0.106
22  x4  |  t1 -1.419 0.159
23  x4  |  t2 -0.230 0.128
24  x4  |  t3  0.921 0.144
25  x4  |  t4  1.702 0.173
26  x5  |  t1 -1.703 0.215
27  x5  |  t2 -0.533 0.153

You can also inspect sample thresholds:

Show code

lavInspect(fit_ord, "sampstat")$th |> head()

  x1|t1   x1|t2   x1|t3   x1|t4   x2|t1   x2|t2 
-0.7860 -0.2140  0.3975  0.8440 -0.5273 -0.0542

Tasks

Pick one item (say x1) and report its thresholds (how many? why?).
What would “more extreme” thresholds imply about item difficulty/endorsement (in plain language)?

Exercise 4 — Local diagnostics in ordinal CFA

4a) Standardized residuals

Show code

res_std <- residuals(fit_ord, type = "cor")$cov
res_std

       x1     x2     x3     x4     x5     x6     x7     x8     x9
x1  0.000                                                        
x2 -0.052  0.000                                                 
x3 -0.006  0.101  0.000                                          
x4  0.090  0.050 -0.079  0.000                                   
x5 -0.016  0.020 -0.133 -0.004  0.000                            
x6  0.017 -0.001 -0.048 -0.012  0.010  0.000                     
x7 -0.144 -0.165 -0.073  0.056 -0.013  0.001  0.000              
x8 -0.065 -0.043  0.008 -0.058 -0.054 -0.033  0.161  0.000       
x9  0.068  0.032  0.150  0.046  0.061 -0.028 -0.074 -0.072  0.000

Find the largest absolute residuals:

Show code

R <- res_std
diag(R) <- NA
top_res <- as.data.frame(as.table(R))
top_res <- top_res[order(abs(top_res$Freq), decreasing = TRUE), ]
head(top_res, 10)

   Var1 Var2   Freq
16   x7   x2 -0.165
56   x2   x7 -0.165
62   x8   x7  0.161
70   x7   x8  0.161
27   x9   x3  0.150
75   x3   x9  0.150
7    x7   x1 -0.144
55   x1   x7 -0.144
23   x5   x3 -0.133
39   x3   x5 -0.133

4b) Modification indices + EPC/SEPC

Show code

mi <- modificationIndices(fit_ord, sort. = TRUE)
head(mi[, c("lhs","op","rhs","mi","epc","sepc.all")], 15)

        lhs op rhs    mi    epc sepc.all
133      x7 ~~  x8 29.83 -0.651   -0.651
87   visual =~  x9 29.44 -1.156   -0.662
135      x8 ~~  x9 12.48  1.043    1.043
85   visual =~  x7 12.04  0.348    0.295
90  textual =~  x3 11.41  0.338    0.280
120      x3 ~~  x9  9.73 -0.446   -0.446
112      x2 ~~  x7  7.30  0.239    0.239
105      x1 ~~  x7  7.03  0.407    0.407
93  textual =~  x9  6.51 -0.314   -0.180
86   visual =~  x8  5.60  0.309    0.239
88  textual =~  x1  5.60 -0.496   -0.273
134      x7 ~~  x9  5.50  0.522    0.522
116      x3 ~~  x5  5.10  0.413    0.413
96    speed =~  x3  4.72 -0.242   -0.201
92  textual =~  x8  4.30  0.158    0.122

Questions

Do residuals and MI point to the same problematic pairs?
In an ordinal CFA, which modifications are most common?
- correlated residuals (local dependence / method)
- cross-loadings (threaten simple structure)
- freeing thresholds/loadings across groups (invariance context)

Exercise 5 — Ordinal measurement invariance across groups (school)

We test invariance across school (Pasteur vs Grant-White).
We keep the same 3-factor model and treat items as ordinal.

5a) Configural model (same pattern of loadings)

Show code

dat_ord_g <- dat_ord
dat_ord_g$school <- dat_full$school  # keep group variable

fit_g0 <- cfa(
  m3, data = dat_ord_g,
  group = "school",
  ordered = vars,
  std.lv = TRUE,
  parameterization = "theta"
)

fitMeasures(fit_g0, c("cfi","tli","rmsea","srmr"))

  cfi   tli rmsea  srmr 
0.988 0.982 0.062 0.073

5b) Threshold invariance (equal thresholds)

Show code

fit_g1 <- cfa(
  m3, data = dat_ord_g,
  group = "school",
  group.equal = c("thresholds"),
  ordered = vars,
  std.lv = TRUE,
  parameterization = "theta"
)

fitMeasures(fit_g1, c("cfi","tli","rmsea","srmr"))

  cfi   tli rmsea  srmr 
0.970 0.970 0.082 0.075

5c) Threshold + loading invariance (often the key step)

Show code

fit_g2 <- cfa(
  m3, data = dat_ord_g,
  group = "school",
  group.equal = c("thresholds", "loadings"),
  ordered = vars,
  std.lv = TRUE,
  parameterization = "theta"
)

fitMeasures(fit_g2, c("cfi","tli","rmsea","srmr"))

  cfi   tli rmsea  srmr 
0.972 0.974 0.076 0.075

5d) Compare nested models (WLSMV difference tests)

Show code

lavTestLRT(fit_g0, fit_g1, fit_g2)


Scaled Chi-Squared Difference Test (method = "satorra.2000")

lavaan->lavTestLRT():  
   lavaan NOTE: The "Chisq" column contains standard test statistics, not the 
   robust test that should be reported per model. A robust difference test is 
   a function of two standard (not robust) statistics.
       Df AIC BIC Chisq Chisq diff Df diff Pr(>Chisq)    
fit_g0 48          75.7                                  
fit_g1 72         143.7       93.7      24    3.4e-10 ***
fit_g2 78         144.6        1.0       6       0.98    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpretation tasks

Does adding equal thresholds worsen fit meaningfully?
Does adding equal loadings (on top of thresholds) worsen fit meaningfully?
Based on this ladder, what comparisons would you feel comfortable making?
- factor structure? (configural)
- relations among factors? (threshold+loading invariance often needed)
- latent means? (typically needs at least thresholds+loadings; and later we discuss partial invariance)

Wrap-up

Deliverables

A short paragraph: continuous-vs-ordinal CFA — what changed and why?
A table (or bullet list) summarizing:
- top thresholds for one item
- top residual pairs + what you think they represent
Invariance summary across school: configural vs thresholds vs thresholds+loadings

Solutions (instructor version)