Chi Square Difference Test | Nested Model Fit Decisions

A chi-square difference check compares two nested models by testing whether the added constraints worsen fit beyond what random sampling can explain.

You run two models that use the same variables and the same cases. One model is looser (fewer constraints). The other is tighter (more constraints). The chi-square difference step asks one plain question: did the tighter model lose too much fit?

This is the same logic as a likelihood-ratio test. Many packages print it as an “LRT” or “χ² diff” row.

Chi-square difference testing for nested models and constraints

Think of a “nested” pair as two models where the tighter one can be created by adding constraints to the looser one. Same dataset. Same indicators or variables. Same estimation target. Only the constraint set changes.

The test uses two numbers from each run: the model chi-square and its degrees of freedom (df). You compute:

  • Δχ² = χ²(tighter) − χ²(looser)
  • Δdf = df(tighter) − df(looser)

Then you treat Δχ² as a chi-square value with Δdf degrees of freedom. A small p-value means the added constraints cost more fit than you’d expect from sampling noise.

Chi Square Difference Test in one sentence

It’s a nested-model check: “Are the extra constraints still plausible, given the data?”

When the test is the right tool

Use it when you are comparing nested models and you truly need a formal pass/fail signal on constraints. Common uses include:

  • Measurement invariance steps (configural → metric → scalar → strict)
  • Testing whether a block of paths can be fixed to zero
  • Checking whether correlations can be set equal
  • Comparing a one-factor model to a two-factor model when one is a constrained version of the other

Two quick “don’ts” save headaches:

  • Don’t use it for models that are not nested. The subtraction step may still output a number, but it does not carry the chi-square reference distribution you’re relying on.
  • Don’t mix samples. If one model used listwise deletion and the other used FIML with missing data, you no longer have a clean nested pair.

What you need before you trust the result

The classic difference test relies on the standard chi-square test statistic from maximum likelihood under familiar conditions. In practice, the two checks below catch most mistakes.

Check that the models are truly nested

A nested pair shares the same observed variables and the same cases, and the tighter model can be formed by adding constraints to the looser model. In SEM software, this often means the looser model has fewer degrees of freedom.

Check expected counts or estimation type

In contingency-table work, chi-square approximations rely on adequate expected counts per cell. In SEM, the shape of the reference distribution depends on the estimator. Scaled estimators often need a scaled difference method, not a raw subtraction.

The NIST chi-square goodness-of-fit section gives a clear statement of the chi-square idea and why expected frequencies and reference distributions matter. That same logic carries into model comparison: you’re leaning on an approximation, so you want the conditions to match the statistic you’re using.

How to run the test step by step

Here’s a workflow that works in SEM, CFA, IRT, and many GLM-style settings where nested models share a likelihood-based fit measure.

Step 1: Fit the looser model first

Start with the model with fewer constraints. Save its χ² and df. Also record the estimator you used (ML, MLM, MLR, WLSMV, and so on). This detail decides whether you can use the plain subtraction.

Step 2: Fit the tighter model on the same data

Apply the added constraints. Fit again. Confirm the sample size and the variable set match the first run. If your software silently dropped cases in one run and not the other, stop and fix that before you go on.

Step 3: Compute Δχ² and Δdf

Subtract the looser model’s statistics from the tighter model’s statistics. If your software prints an “anova” table for the pair, check it matches your manual subtraction. Small discrepancies can happen if the tool is using a scaled method behind the scenes.

Step 4: Convert the difference to a p-value

With the classic method, you treat Δχ² as chi-square with Δdf degrees of freedom and read the upper-tail probability. Many packages do this automatically. If you want to sanity-check by hand, a chi-square table or a chi-square CDF function will match what the software prints.

Step 5: Decide what to do with the result

A result with a large p-value means the tighter model’s loss of fit looks consistent with sampling noise. Many teams treat that as permission to keep the tighter model for its simplicity and cleaner interpretation.

A result with a small p-value means the tighter model lost too much fit. At that point, the next move depends on your goal. You might keep the looser model, or relax only a small subset of constraints and test again.

What to do with scaled estimators and scaled statistics

Modern SEM workflows often use scaled test statistics to deal with non-normality or other issues. When the printed model χ² is scaled, the raw subtraction can be wrong. In that case, you need a scaled difference method that matches your estimator.

Mplus publishes a plain-language walk-through for Satorra–Bentler scaled difference testing, including the scaling correction pieces you need to compute the adjusted difference statistic: “Chi-Square Difference Testing Using the Satorra-Bentler Scaled Chi-Square”.

If you use lavaan, the package provides functions that handle nested comparisons and, when requested, the relevant scaled methods for many estimators. The lavaan lavTestLRT documentation describes how model sequences are ordered by degrees of freedom and compared.

Rule of thumb: if your output labels the test statistic as scaled, or labels like MLR/MLM/WLSMV appear, do not assume the simple Δχ² subtraction is valid. Use the method your software documents for your estimator.

Common setups and the right difference test to use

The same phrase gets used for several nearby tasks. This table helps you match the setup to the right computation and the right caution.

Setup What changes between models Safer test choice
Basic nested SEM (ML) Extra equality constraints or fixed paths Raw Δχ² with Δdf
Measurement invariance (continuous, ML) Loadings, intercepts, residuals fixed equal Raw Δχ² plus fit index checks
Scaled ML (MLR/MLM) Same as above, with scaled χ² Scaled difference method tied to estimator
WLSMV / categorical indicators Thresholds and loadings constrained Software’s DIFFTEST-style method
Contingency tables (independence) Observed counts vs expected under independence Pearson χ² with df = (r−1)(c−1)
Goodness-of-fit to a fixed distribution Observed counts vs theoretical probabilities Chi-square GOF with merged bins if needed
Model comparison via log-likelihood Nested likelihoods in GLM or SEM Likelihood-ratio test (often χ²-based)
Small expected counts Sparse tables or rare categories Exact methods or Monte Carlo p-value

Interpreting results without getting trapped by p-values

A difference test gives a yes/no signal on constraints, but your decision often needs more than that. Here are practical guardrails that keep the result in context.

Use the result as one input, not the whole decision

With big samples, tiny misfit can yield a tiny p-value. With small samples, large misfit can slip through with a big p-value. That’s not a flaw in the software; it’s how sampling works.

To keep your decision grounded, pair the difference test with effect-size style fit checks. In SEM, teams often review CFI, RMSEA, SRMR, and parameter changes. You’re checking whether the constraint changes the model in a way that changes your story, not only whether a tail probability crossed a threshold.

Watch the direction of change

In nested comparisons, the tighter model should have:

  • Higher df
  • Equal or higher χ²

If you see the opposite, stop. It can happen when the models are not nested, the data differ, or the printed statistic is scaled in a way that needs special handling.

Report what a reader needs to reproduce the check

A clean report line usually includes: χ² and df for each model, the estimator, Δχ², Δdf, and the p-value. If you used a scaled method, name it and give the corrected difference statistic the software produced.

Mistakes that cause wrong chi-square difference results

Most errors come from one of five sources.

Mixing data handling between models

If one model used different missing-data handling or a different sample, you broke the nested pairing. Fix the data step first, then rerun both models.

Comparing models that share variables but are not nested

Two models can look similar and still fail the nested rule. A two-factor model and a one-factor model can be nested in some cases, but only when the tighter model is directly a constrained version of the looser model with the same indicators and structure. If the structure differs in a way that cannot be written as constraints, the difference test is off the table.

Using raw subtraction with a scaled statistic

If your software prints “scaled”, or labels like MLR/MLM/WLSMV appear, treat that as a red flag. Use the package’s documented difference procedure, such as the Satorra–Bentler method.

Forgetting that df change must match the constraint count

In many setups, Δdf equals the number of constraints you added. If it doesn’t, reread what actually changed between models. A mislabeled run or a copied syntax block can trick you here.

Trusting a p-value when expected counts are thin

In contingency tables, sparse expected counts weaken the chi-square approximation. Some tools offer simulated p-values or exact tests. The R stats::chisq.test help page is a good reference for continuity correction options and simulated p-values in count-data settings.

Practical checklist for cleaner model comparison

Use this list right before you write up results. It’s short on purpose, and it catches most “why is this weird?” moments.

Checkpoint What to verify What it prevents
Same cases N matches in both outputs Invalid pairing
Same variables Identical indicator set and coding Non-nested comparison
Nesting is real Tighter model equals looser model + constraints Misleading Δχ²
Estimator matches method Scaled stats use the right diff routine Wrong p-value
Δdf is sensible Δdf lines up with constraints you added Syntax errors
Direction check df goes up; χ² does not go down Order mistakes
Decision matches goal Fit loss weighed against interpretability Overfitting or oversimplifying

Writing a clear results paragraph

Keep it readable: name the two models, state what constraints changed, then report each model’s χ²(df), Δχ², Δdf, and the p-value. Finish with the choice you made.

References & Sources