Chi-Square Calculator Explained: The Formula, the p-value, and the Common Mistakes

Pearson chi-square is the standard test for asking whether a set of category counts matches a hypothesised distribution. This guide unpacks the formula, walks through Mendel’s pea data as a worked example, lays out when the chi-square approximation is reliable, and is direct about the mistakes that tank chi-square results in practice.

#math#statistics#chi-square#hypothesis-test#goodness-of-fit#p-value

What the chi-square test actually does

A Pearson chi-square test asks a single, narrow question: do my observed category counts deviate from a hypothesised distribution by more than chance would predict? Nothing more and nothing less. The chi-square calculator on this page runs the goodness-of-fit variant of that test on a single set of counts, returns the chi-square statistic, the degrees of freedom, and the right-tailed p-value, and shows you exactly how much each cell contributed to the result.

The test was published by Karl Pearson in 1900 and is one of the oldest formal statistical tests still in everyday use. That longevity is not an accident. The maths is simple, the assumptions are easy to check, and the answer it gives — a single number you can compare against a critical value — fits neatly inside almost any reporting pipeline. It is the default first move for any researcher staring at a frequency table.

Two flavours show up in textbooks. The goodness-of-fit test, which this calculator runs, compares one categorical distribution against a hypothesised one. The test of independence compares two categorical variables in a contingency table to see whether they are associated. The underlying chi-square statistic is identical; only the degrees-of-freedom rule and the source of the expected counts differ.

The formula in plain English

For each of k categories you have an observed count Oi and an expected count Ei. The expected count is what you would see in that category if the null hypothesis were exactly true. The chi-square statistic is then:

χ² = Σᵢ (Oᵢ − Eᵢ)² ÷ Eᵢ

Read it cell by cell. Take the gap between observed and expected, square it so that overshoots and undershoots contribute equally, then scale by the expected count. The scaling matters: a deviation of five counts is enormous in a cell where you expected ten, and trivial in a cell where you expected ten thousand. Dividing by the expected count puts every cell on the same footing.

Sum across all cells and you have a single number that summarises how far the observations stray from the hypothesised distribution. Larger χ² means a worse fit; χ² of exactly zero would mean every observed count landed on its expected value to the decimal place. Under the null hypothesis, with the expected vector fully specified ahead of time, χ² follows a chi-square distribution with df = k − 1 degrees of freedom. The p-value is the right tail of that distribution at the observed χ²:

p = P(χ²_{df} ≥ observed χ²)

A small p-value means the data would be surprising under the null, so the null is rejected. A large p-value means the data are consistent with the null — note that this is not positive evidence for the null, only an absence of evidence against it. The chi-square calculator does the integral for you using a regularised incomplete gamma function, which is accurate to machine precision across the full tail.

Worked example: Mendel’s peas

Gregor Mendel’s 1865 paper on hybrid peas contains the single most-quoted goodness-of-fit dataset in statistics. Mendel crossed plants showing two pairs of traits — round versus wrinkled seeds, yellow versus green cotyledons — and counted 556 plants in the F2 generation. His hypothesised 9 : 3 : 3 : 1 ratio for the four phenotype classes (round-yellow, wrinkled-yellow, round-green, wrinkled-green) predicted the following expected counts under the null:

  • Round-yellow: 556 × 9/16 = 312.75
  • Wrinkled-yellow: 556 × 3/16 = 104.25
  • Round-green: 556 × 3/16 = 104.25
  • Wrinkled-green: 556 × 1/16 = 34.75

Mendel observed 315, 108, 101, and 32 respectively. Plugging those into the chi-square calculator with the ratio “9 3 3 1” in the expected box — the calculator rescales ratios automatically — produces:

  • (315 − 312.75)² ÷ 312.75 = 0.0162
  • (108 − 104.25)² ÷ 104.25 = 0.1349
  • (101 − 104.25)² ÷ 104.25 = 0.1013
  • (32 − 34.75)² ÷ 34.75 = 0.2176

Total: χ² ≈ 0.470 on df = 3, giving p ≈ 0.925. That p-value is almost embarrassingly large. Mendel’s observations land so close to the 9 : 3 : 3 : 1 prediction that R. A. Fisher famously argued, in a 1936 essay, that the fit was suspiciously good — likely the result of unconscious selection by Mendel’s gardener rather than fraud, but too clean to be raw chance. Either way, the test does its job: it cannot rule out the hypothesis.

Contrast that with a different dataset. Suppose you ran a four-sided die 100 times and observed counts of (30, 10, 20, 40). Against a uniform expected of (25, 25, 25, 25) — every face equally likely — the per-cell contributions are 1 + 9 + 1 + 9 = 20. So χ² = 20 on df = 3, with p ≈ 0.00017. The null hypothesis of a fair die is rejected at any conventional significance level. Run both examples in the chi-square calculator to see how the per-cell breakdown lines up.

Factors that affect the chi-square statistic

Sample size

The chi-square statistic scales roughly linearly with sample size when the proportional deviations are held fixed. That is: if you double every count in the observed and expected vectors, χ² doubles. A small deviation from expected becomes statistically significant once the sample is large enough, even when the practical effect is tiny. This is one of the most common misreadings of chi-square results — a significant p-value with N = 1,000,000 may reflect a deviation of half a percentage point that nobody cares about. Always inspect the per-cell contributions to judge whether the rejection is meaningful.

Number of categories

More categories spread the same total over more cells, increasing degrees of freedom and shifting the chi-square null distribution to the right. A χ² of 10 is highly significant on df = 1 (p ≈ 0.0016) but unremarkable on df = 20 (p ≈ 0.97). The shape of the null distribution adjusts because more categories give the data more ways to deviate from expected without truly differing in pattern.

Cell sparseness

The chi-square distribution is the asymptotic null distribution as expected counts grow large. With sparse cells the approximation degrades. The standard rule of thumb, attributed to William Cochran in 1954, is that every expected count should be at least 5; some authors relax this to allow up to 20 percent of expected counts below 5, provided none are below 1. With sparser data the asymptotic p-value can be biased — sometimes too liberal, sometimes too conservative depending on the configuration.

Whether expected counts are estimated from the data

If you estimate m parameters of the expected distribution from the same data you are testing — fitting a Poisson mean to the counts to test Poisson goodness-of-fit, for example — the degrees of freedom drop to k − 1 − m. The calculator assumes a fully specified expected vector and reports df = k − 1; if you estimated parameters first you will need to subtract them when looking up the p-value yourself.

Multiple comparisons

Running many chi-square tests across many subgroups will produce false positives at the rate of the chosen alpha. A single test at α = 0.05 has a 5 percent false-positive rate; twenty independent tests have a 64 percent chance of at least one false positive. Use a Bonferroni or Benjamini-Hochberg correction when running families of tests, and report the number of tests run honestly.

How to interpret the result

  • Look at the p-value, but never alone. A p-value answers a narrow question — how surprising the data are under the null. It does not measure effect size, practical importance, or the probability that the null is true. Always pair it with the per-cell contributions.
  • Inspect the cell breakdown. The chi-square calculator returns (Oᵢ − Eᵢ)² ÷ Eᵢ for every cell. Cells with the largest contributions are driving the result — they tell you which categories diverge from expected, not just that the overall fit is poor.
  • Check the expected counts before celebrating. If any expected count is below 5 the asymptotic p-value should be treated with caution. Pool adjacent rare categories, run an exact multinomial test, or use a G-test (likelihood-ratio variant) instead.
  • Report the statistic, df, and p-value together. A chi-square result is typically written χ²(df) = statistic, p = value — for the Mendel example, χ²(3) = 0.47, p = 0.93. Reporting only the p-value strips the context a reader needs.
  • Consider effect size. For a 2 × 2 contingency table, the φ coefficient (sqrt of χ²/N) gives a Pearson correlation in the [0, 1] range; for larger tables, Cramér’s V is the standard. Effect sizes do not depend on sample size the way p-values do.
  • Pair with a confidence interval. The confidence interval calculator on a proportion gives you the plausible range for each category share, which is often more informative for decisions than the global p-value.

Common mistakes

Treating a non-significant result as evidence the null is true

Failing to reject the null does not prove the null. A high p-value can come from a true null, a true alternative with small effect, or a small sample lacking the power to detect a real difference. If you need to argue that observations and expectations are equivalent, you need an equivalence test, not a non-rejection from chi-square.

Using chi-square on proportions or percentages

The chi-square statistic is defined on counts, not percentages. If you have proportions, multiply them by the sample size first to recover the underlying counts. Running chi-square on percentages directly will produce a meaningless statistic — the test is sensitive to the absolute number of observations, which percentages have thrown away.

Forgetting the independence assumption

Pearson chi-square assumes that each observation is independent. Repeated measures on the same subjects, paired before-and-after data, or clustered survey responses all break that assumption. For paired binary data use McNemar’s test; for clustered data use a Rao-Scott-corrected chi-square or a generalised estimating equation. The vanilla goodness-of-fit test will overstate significance with non-independent observations.

Ignoring zero or near-zero expected cells

An expected count of exactly zero is undefined — the test divides by it. The calculator will refuse to compute χ² in that case. An expected count of 0.5 will not error but will produce a wildly inflated contribution if the observed count is even slightly off zero. Pool sparse categories before testing, or switch to a Fisher exact / multinomial exact framework.

When to seek professional advice

A chi-square test is a standard, well-understood tool and most users can apply it without a statistician’s help. There are a few situations where talking to one is worth the time. Anything with clustered or repeated-measures data needs a corrected test. Anything where the consequences of a wrong call are serious — regulatory submissions, clinical trials, public-policy decisions — deserves a second opinion on the choice of test and on the multiple-comparisons handling. And if you are estimating parameters of the expected distribution from the data you are testing on, the degrees-of-freedom adjustment is easy to get wrong; that is worth a sanity check.

Frequently asked questions

Is chi-square the same as a t-test?

No. A t-test compares means of continuous data; a chi-square test compares frequencies in categorical data. The two questions are unrelated. If you have two groups and a continuous outcome, use a t-test; if you have one or two categorical variables and want to compare counts to a distribution or to each other, use chi-square. There is no chi-square test for a single mean.

Why is the chi-square test always right-tailed?

Because the statistic squares every deviation, both overshoots and undershoots contribute positively. Under the null, χ² should sit near its expected value of df; the further the observations stray in either direction, the larger χ² becomes. There is no “negative chi-square” to indicate the opposite direction of effect, so the test is one-sided by construction. A χ² value close to zero is not a separate kind of evidence; if anything it is suspicious, suggesting the data fit too cleanly.

How do I run a chi-square test of independence in this calculator?

The chi-square calculator runs the goodness-of-fit variant on a single vector of counts. To run a test of independence on a 2 × 2 or larger contingency table, flatten the table into a single vector of observed counts, compute the expected counts by hand using Eᵢⱼ = (row total × column total) / N for each cell, paste the flattened observed and expected vectors into the calculator, and ignore the reported degrees of freedom — the correct df for an r × c contingency table is (r − 1)(c − 1), not rc − 1. The χ² statistic the calculator returns is still correct; only df and the resulting p-value need recalculating against the smaller df.

What is the difference between chi-square and a G-test?

Both are tests of goodness-of-fit on categorical data. The chi-square statistic is Σ(O − E)² / E; the G-test statistic is 2 × Σ O × ln(O / E). Both converge to the same chi-square sampling distribution as expected counts grow, and both give almost identical p-values in well-behaved data. The G-test has slightly better small-sample behaviour and is preferred in some literatures (notably ecology and population genetics). For most applied work the difference is academic.

Does the chi-square calculator handle weighted data?

It treats inputs as raw counts. If your data come from a survey with sampling weights, the unweighted chi-square test is inappropriate — you need a Rao-Scott-corrected chi-square or a Wald test on weighted estimates. Software like R’s survey package or Stata’s svy: prefix is the standard way to handle that. The calculator on this page is for unweighted, independent observations.

How does the calculator compute the p-value internally?

The p-value is P(χ²_{df} ≥ observed) where χ²_{df}is the chi-square distribution with df degrees of freedom. That probability is computed as the upper regularised incomplete gamma function Q(df/2, observed/2). The implementation uses a continued-fraction expansion for moderate-to-large arguments and a series expansion for small arguments, switching at the crossover where each is most accurate. The result is correct to roughly fifteen significant figures across the full range of inputs the calculator accepts.

Can I use chi-square with very large samples?

Yes — the test was designed for it. The asymptotic approximation only improves as N grows. The catch is that very large samples make the test sensitive to deviations that are statistically real but practically meaningless. Always inspect the per-cell contributions and the effect size (φ or Cramér’s V) alongside the p-value when N is in the millions.

How is this different from the p-value calculator?

The p-value calculator converts a test statistic — z, t, chi-square, or F — that you already have into a p-value. The chi-square calculator on this page goes one step further back: you give it the raw observed and expected counts, and it computes χ², the degrees of freedom, and the p-value for you. If you have already run a chi-square test elsewhere and just need the p-value from a χ² and df, use the p-value calculator instead.

Related calculators

Frequently asked questions

Is chi-square the same as a t-test?

No. A t-test compares means of continuous data; a chi-square test compares frequencies in categorical data. The two questions are unrelated. If you have two groups and a continuous outcome, use a t-test; if you have one or two categorical variables and want to compare counts to a distribution or to each other, use chi-square. There is no chi-square test for a single mean.

Why is the chi-square test always right-tailed?

The statistic squares every deviation, so overshoots and undershoots both contribute positively. Under the null hypothesis, chi-square should sit near its expected value of df; the further the observations stray in either direction, the larger chi-square becomes. There is no "negative chi-square" indicating the opposite direction of effect, so the test is one-sided by construction. A chi-square value close to zero is not separate evidence; if anything it is suspicious, suggesting the data fit too cleanly.

How do I run a chi-square test of independence in this calculator?

The calculator runs the goodness-of-fit variant on a single vector of counts. To run a test of independence on a contingency table, flatten the table into a single vector of observed counts, compute expected counts by hand using Eij = (row total times column total) / N for each cell, paste the flattened vectors in, and ignore the reported degrees of freedom. The correct df for an r x c contingency table is (r-1)(c-1), not rc-1. The chi-square statistic is still correct; only df and the p-value need recalculating against the smaller df.

What is the difference between chi-square and a G-test?

Both are goodness-of-fit tests on categorical data. The chi-square statistic is sum((O-E)^2 / E); the G-test is 2 times sum(O times ln(O/E)). Both converge to the same chi-square sampling distribution as expected counts grow large, and both give almost identical p-values on well-behaved data. The G-test has slightly better small-sample behaviour and is preferred in ecology and population genetics. For most applied work the difference is academic.

Does the calculator handle weighted survey data?

It treats inputs as raw counts. If your data come from a survey with sampling weights, the unweighted chi-square test is inappropriate. You need a Rao-Scott-corrected chi-square or a Wald test on weighted estimates. Software like R’s survey package or Stata’s svy: prefix is the standard way to handle that. The calculator on this page assumes unweighted, independent observations.

How does the calculator compute the p-value internally?

The p-value is P(chi-square_df >= observed), computed as the upper regularised incomplete gamma function Q(df/2, observed/2). The implementation uses a continued-fraction expansion for moderate-to-large arguments and a series expansion for small arguments, switching at the crossover where each is most accurate. The result is correct to roughly fifteen significant figures across the full range of inputs the calculator accepts.

Can I use chi-square on very large samples?

Yes. The test was designed for it, and the asymptotic approximation only improves as N grows. The catch is that very large samples make the test sensitive to deviations that are statistically real but practically meaningless. Always inspect the per-cell contributions and an effect size like phi or Cramér’s V alongside the p-value when N is in the millions.

How is this different from the p-value calculator?

The p-value calculator converts a test statistic (z, t, chi-square, or F) that you already have into a p-value. The chi-square calculator on this page goes one step back: it takes raw observed and expected counts and computes chi-square, df, and the p-value for you. If you already ran a chi-square test elsewhere and just need the p-value from a chi-square value and df, use the p-value calculator instead.

Informational only. Not personalised financial, legal, or tax advice.