A chi-square test checks if two categorical variables move together by comparing observed counts to the counts you’d expect from no link.
You’ve got a table of counts. Two categories. Maybe brand vs. device type, plan vs. churn, symptom vs. test result, or traffic source vs. sign-up. You want one clean answer: are these two variables linked, or are the differences just noise?
That’s the job of the chi-square contingency test (often called a chi-square test of independence). It’s fast, widely used, and easy to explain to non-stat folks if you frame it the right way.
This article shows a practical way to run it, read it, and avoid the common traps that make results shaky.
What The Chi-square Test Is Doing Under The Hood
A contingency table is just a grid of counts. Rows are categories of one variable. Columns are categories of another variable. Each cell is how many cases landed in that row-column combo.
The test starts with a simple null idea: the row variable and column variable don’t link. If that null were true, you could predict each cell’s count from just the row total and column total.
Those predicted numbers are called expected counts. The chi-square statistic sums up how far observed counts stray from expected counts, scaled so big cells don’t drown out small ones.
Expected Counts In One Line
For any cell, expected count = (row total × column total) ÷ grand total. That’s the core calculation taught in many stats courses, including Penn State’s lesson on the chi-square test of independence: Chi-square test for independence.
If your observed counts track those expected counts closely, the statistic stays small. If they stray a lot across many cells, the statistic grows.
Degrees Of Freedom Without The Headache
For an r × c table, degrees of freedom = (r − 1) × (c − 1). That comes from how many cells are free to vary after totals are set.
The test then turns the chi-square statistic and degrees of freedom into a p-value using the chi-square distribution, covered in the NIST handbook pages on the distribution and critical values: Chi-square distribution and critical values table.
When This Test Fits And When It Doesn’t
This test fits when your data are counts in categories, and each observation lands in one cell. It’s a natural match for surveys, experiments with grouped outcomes, and event logs where each record maps to one row label and one column label.
It’s not a match for averages, medians, or measurements like time-on-page. If the data are numeric measurements, you’re in a different toolset.
Use Counts, Not Percentages
Run the test on raw counts. Percent tables are nice for reading, but the math needs counts. If all you have are percentages, you’ll need the sample size behind them to rebuild counts.
Independence Needs To Be Real
Each row in your dataset should represent a separate, unrelated observation. If one person appears twice, or if repeated measures show up in the same table, the usual chi-square setup can misbehave.
Expected Counts Can’t Be Too Tiny
The chi-square approximation is known to get shaky when expected counts are small. SciPy’s documentation calls out the often-cited “at least 5” guideline for expected frequencies: scipy.stats.chi2_contingency.
So what do you do if you see small expected counts? You’ve got a few options: combine rare categories, collect more data, or switch to an exact or simulation-based method, depending on your tooling and table size.
How To Set Up Your Table So The Result Means Something
Most messy results come from messy tables. A little prep saves you from chasing ghosts.
Pick Categories That Match The Question
Start with a plain-language question. “Is device type linked to conversion?” is clear. “Is device linked to everything?” isn’t. Build the table around the one relationship you want to check.
Keep Categories Mutually Exclusive
Each observation should land in exactly one row category and one column category. If categories overlap, counts become hard to trust.
Watch Sparse Tail Categories
Long tails (dozens of rare categories) inflate empty or tiny cells. If those rare levels aren’t central to your decision, merge them into an “Other” bucket that you can explain.
If you do merge, do it with rules you can defend, like “combine levels with fewer than 10 observations” or “merge tiers that mean the same thing operationally.”
| Common Use Case | Typical Table Shape | Notes That Keep You Out Of Trouble |
|---|---|---|
| Plan Tier × Churn (Yes/No) | 3 × 2 | Check expected counts in each tier; low-volume tiers may need merging. |
| Device Type × Conversion (Convert/No) | 4 × 2 | Make sure each session or user is counted once; repeated sessions per user can distort results. |
| Region × Product Category Purchased | 5 × 6 | Sparse categories appear fast; “Other” buckets can clean up tail levels. |
| Ad Creative × Click Outcome | k × 2 | With many creatives, run a screen first to remove near-zero spend variants. |
| Support Channel × Issue Type | 4 × 7 | Ambiguous labeling is a silent killer; audit your tagging rules before testing. |
| Store Format × Refund Reason | 3 × 8 | Small expected counts are common in rare reasons; merge similar reasons when it matches policy. |
| Segment × Survey Response (Likert) | m × 5 | Likert bins can be merged (1–2, 3, 4–5) when you need sturdier expected counts. |
| Test Group × Outcome Category (Multi-class) | 2 × c | If you planned the bins after seeing outcomes, note the bias risk in your write-up. |
Chi Square Contingency Test Results With Fewer Misreads
The output from most software includes the chi-square statistic, degrees of freedom, and a p-value. Some tools also give expected counts, residuals, or an effect size.
Here’s the clean mental model: the p-value answers “Would a table this far from expectation be rare if there were no link?” A small p-value points toward a link. A large p-value points toward “no clear evidence of a link” in this sample.
What A P-value Can’t Tell You
A p-value won’t tell you how big the relationship is. With huge sample sizes, tiny differences can yield tiny p-values. With small sample sizes, large differences can miss detection.
So you should pair the p-value with a plain-language check: which cells differ most from what you’d expect? Then add an effect size that matches your table, like Cramér’s V, when you need a single number summary.
Residuals Point To The Cells Driving The Signal
Once the test says “there’s a link,” the next practical question is “where?” The gap between observed and expected counts, cell by cell, gives you that answer.
Many stats packages provide standardized residuals. Even when they don’t, you can still read the biggest gaps by scanning observed vs. expected counts.
Yates Correction And Small 2 × 2 Tables
For a 2 × 2 table, some tools apply a continuity correction (often called Yates correction) by default. It tends to be conservative. R’s documentation notes that its chisq.test function has a correct option for this behavior: R: chisq.test.
If you’re working with small counts in a 2 × 2 setup, an exact test may be a better fit than leaning on corrections.
| Output Piece | What It Tells You | Gotcha To Watch |
|---|---|---|
| Chi-square statistic (χ²) | Overall gap between observed and expected counts across the table | It grows with sample size; don’t treat it as a “strength” score. |
| Degrees of freedom | How much freedom the table has after totals are fixed | Double-check row/column counts after merges; df changes. |
| P-value | How surprising the gaps would be under “no link” | A small p-value can come from tiny differences with large N. |
| Expected counts | Predicted cell counts under “no link” | Small expected counts can make the approximation shaky. |
| Cell gaps (Observed − Expected) | Which categories sit above or below expectation | Raw gaps favor big-margin categories; scan proportions too. |
| Effect size (Cramér’s V) | Single-number sense of relationship strength | Still needs context; “big” depends on domain and stakes. |
How To Run It In Common Tools Without Guesswork
You can compute this test by hand for a small table, but software keeps it clean and repeatable. The main win is access to expected counts and diagnostic outputs without transcription errors.
Python With SciPy
SciPy’s chi2_contingency takes a table of observed counts and returns the statistic, p-value, degrees of freedom, and expected counts. The official reference spells out parameters, including the continuity correction setting and methods that can use resampling: SciPy chi2_contingency docs.
Two practical reminders:
- Feed it counts, not rates.
- Keep an eye on expected counts if you’ve got a wide table with a long tail.
R With chisq.test
In R, chisq.test runs the test and can also return expected counts. The manual page shows the full argument list, including the continuity correction and simulation options for p-values: R stats package reference.
If your expected counts are low, simulation-based p-values can be a sensible fallback when an exact approach is heavy for your table size.
NIST Dataplot Reference If You Want A Formal Definition
If you need a standards-leaning description for documentation, NIST’s Dataplot reference describes a chi-square test of independence for two-way tables: NIST Dataplot chi-square independence test.
What To Write In Your Report So Readers Trust The Call
A good write-up is short and concrete. It tells readers what table was tested, what the null claim was, and what the test output said. Then it points to the parts of the table that matter for the decision.
Include The Table Definition
Name the row variable and column variable, list the category levels, and state the sample size. If you merged categories, note the rule used to merge. That keeps readers from thinking categories were shuffled to chase a desired outcome.
State The Null In Plain Language
Use wording like: “The distribution across columns is the same for all rows” or “Row and column labels don’t link.” That’s clearer than jargon-heavy phrasing.
Report The Core Numbers And One Plain Sentence
Most readers want one sentence they can repeat in a meeting. Something like: “The table shows a detectable link between X and Y in this sample (χ², df, p).” Then add the next sentence that matters even more: “The largest gaps show up in cells A and B.”
Show Practical Size With Percentages
After the test, add row-wise or column-wise percentages so the difference is readable. Keep the test on counts, but let humans read the pattern with percentages.
Common Mistakes That Make A Clean Table Lie
Most “bad chi-square” stories come from a few repeat offenders. If you watch these, you’ll dodge a lot of grief.
Mixing People And Events In One Table
If one user can contribute multiple rows, counts inflate and the independence assumption gets shaky. Fix it by picking a unit: per user, per session, per order. Then build the table from that unit only.
Testing After Heavy Filtering
If you filter down to a tiny slice (like “only users who clicked twice”), your table may turn sparse. Sparse tables lead to small expected counts, which weakens the approximation.
Letting “Unknown” Swallow Real Categories
“Unknown” is a category like any other. If it’s large, it can dominate the pattern and mask what you care about. Treat missingness as its own story: log it, track it, and decide if it belongs in the table for this decision.
Running Many Tests And Celebrating One Small P-value
If you test dozens of tables, one small p-value can pop up by chance. If you’re screening many combinations, write down how many tests you ran and use a correction method or a holdout check before you act on a single result.
Quick Checklist Before You Hit Run
- Counts only, no percentages.
- Each observation lands in one cell.
- No duplicated subjects unless you’ve built a per-subject table.
- Expected counts aren’t tiny across the table.
- Percent view ready for humans after the test runs.
- Plan to report the cells that drive the gaps, not just the p-value.
Run the test, read expected counts, then read the pattern. If the output and the pattern agree, you’re in a strong spot. If they clash, your table design or assumptions need a second pass.
References & Sources
- Penn State Eberly College of Science (STAT 500).“Chi-Square Test for Independence – Lesson 8.”Walks through expected counts, test setup, and interpretation for independence in contingency tables.
- National Institute of Standards and Technology (NIST/SEMATECH).“Chi-Square Distribution.”Background on the chi-square distribution used to obtain p-values for chi-square tests.
- National Institute of Standards and Technology (NIST/SEMATECH).“Critical Values of the Chi-Square Distribution.”Reference table and guidance for critical values tied to chi-square degrees of freedom.
- SciPy Documentation.“scipy.stats.chi2_contingency.”Defines the chi-square contingency test in Python, including outputs, assumptions, and options.
- The R Project (R Manual).“chisq.test: Pearson’s Chi-squared Test for Count Data.”Official function reference for chi-square tests in R, including continuity correction and simulation settings.
- NIST Dataplot Reference Manual.“CHI-SQUARE INDEPENDENCE TEST (LET).”Formal definition of a chi-square independence test for two-way contingency tables.