ANOVA Explained: F-test, p-value, eta-squared, assumptions

Q: How do I report ANOVA results in a paper?

Conventional reporting style is "F(df_between, df_within) = X, p = Y, η² = Z." For a 3-group, 15-observation dataset with F = 20.79 that becomes "F(2, 12) = 20.79, p < 0.001, η² = 0.78." Round F to two decimals, round p to three (or report "< 0.001" for very small values), and round eta-squared to two.

Q: What if my F statistic is exactly 1?

F ≈ 1 means MS between ≈ MS within: the spread of the group means is no bigger than the spread within each group, which is exactly what you would expect under the null hypothesis. The p-value will be roughly 0.5 (close to it, depending on the degrees of freedom), and you have no evidence that the groups differ.

Q: Can I run ANOVA on percentages or proportions?

Cautiously. For mild cases (percentages mostly in the 30–70 range) ANOVA is fine. For percentages near 0 or 1, transform with logit or arcsine-square-root first, or switch to a logistic regression / generalised linear model with a binomial response. The same caveat applies to count data near zero, which is better handled by a Poisson or negative-binomial GLM.

Q: What is the relationship between ANOVA and regression?

They are the same model. One-way ANOVA with k groups is identical to a linear regression with k − 1 dummy variables (one per non-reference group) plus an intercept. The F-test from ANOVA is the overall F-test for the regression model versus an intercept-only model. Eta-squared from ANOVA equals R² from the regression.

Q: How large should each group be?

Three observations per group is the practical floor for three groups; for meaningful inference, statistics texts often suggest at least 10–30 per group, and small-effect detection can need hundreds. Aim for equal group sizes when you can because that maximises power and minimises sensitivity to unequal variances.

Q: Where does the F-distribution come from?

The F-distribution is the ratio of two independent chi-squared random variables, each divided by its degrees of freedom. Under the null hypothesis, MS between and MS within are independent estimators of the same population variance σ², each proportional to a chi-squared. Their ratio cancels σ² and lands on an F-distribution with (k − 1, N − k) degrees of freedom.

What ANOVA is for

Analysis of variance — ANOVA — is the statistical test that asks whether the means of several groups are all the same, or whether at least one of them is genuinely different from the rest. You hand it three sets of exam scores, or five batches of yield from five factories, or four reaction times from four drug arms, and it gives back a single yes/no answer: is the spread between the group averages bigger than you would expect from random sampling noise alone? The ANOVA calculator on Calc Dragon runs the one-way version of that test — one categorical factor with between two and five levels — and returns the F statistic, the degrees of freedom, the p-value, and eta-squared as an effect size, all from a paste of raw numbers.

The intuition is older than the formula. If you take three groups whose populations actually share a common mean, the three sample means will still differ — sampling noise alone guarantees that. The question is whether they differ more than that noise accounts for. ANOVA answers by comparing two estimates of the underlying variance. One estimate is the spread of observations around their own group mean; that estimate is honest regardless of whether the groups truly differ. The other estimate is the spread of the group means themselves, scaled up; that one is honest only when the groups are equal, and is inflated otherwise. If the second estimate is much larger than the first, the groups must really differ.

The formula: F = MS between ÷ MS within

The arithmetic is a bookkeeping exercise on a single quantity — the total sum of squared deviations from the grand mean — split into two pieces. Call the groups 1, 2, …, k, with sizes n_i, means x̄_i, and N = Σ n_i total observations across the grand mean x̄.

Total sum of squares. SS_T = Σ Σ (x_ij − x̄)². Every observation in every group, squared deviation from the grand mean, added up. This is the "raw" variation you started with.
Between-group sum of squares. SS_B = Σ n_i (x̄_i − x̄)². Each group mean's deviation from the grand mean, squared, weighted by how many observations contributed to that group mean. This is the variation explained by group membership.
Within-group sum of squares. SS_W = Σ Σ (x_ij − x̄_i)² = SS_T − SS_B. The leftover spread of each observation around its own group's mean. This is the noise floor — the variability that group membership can't explain.

Each sum of squares has a degrees-of-freedom partner: df_B = k − 1, df_W = N − k, df_T = N − 1, and the two partials add to the total. Dividing each SS by its df gives a mean square — an estimate of variance — and the ratio of the two is the test statistic:

F = (SS_B / (k − 1)) ÷ (SS_W / (N − k)) = MS_B / MS_W

Under the null hypothesis that all population means are equal, F follows an F-distribution with (k − 1, N − k) degrees of freedom. The p-value is the right tail of that distribution: the probability of seeing an F at least this large by chance if every group really comes from the same population. Small p-values are evidence that the groups do not share a mean. The ANOVA calculator handles all of this arithmetic — the F statistic, the right-tailed p, and the sum-of-squares table — from one paste of numbers.

Worked example: three batches of widgets

Take three production lines turning out widgets, and pull a sample of five widgets off each line. Measure the strength of each in arbitrary units:

Group 1 (Line A): 5, 6, 7, 8, 7. Mean x̄₁ = 6.6.
Group 2 (Line B): 9, 10, 11, 10, 12. Mean x̄₂ = 10.4.
Group 3 (Line C): 6, 7, 6, 8, 7. Mean x̄₃ = 6.8.

All up, N = 15, k = 3, and the grand mean is x̄ = (33 + 52 + 34) / 15 = 7.933. Line B's average looks higher than the other two; the ANOVA tells you whether that difference survives a hypothesis test.

SS between. 5 · (6.6 − 7.933)² + 5 · (10.4 − 7.933)² + 5 · (6.8 − 7.933)² = 5 · 1.778 + 5 · 6.084 + 5 · 1.284 ≈ 8.89 + 30.42 + 6.42 ≈ 45.73.

SS within. Within Line A, the squared deviations from 6.6 are (1.6² + 0.6² + 0.4² + 1.4² + 0.4²) = 2.56 + 0.36 + 0.16 + 1.96 + 0.16 = 5.2. Within Line B, deviations from 10.4 give (1.96 + 0.16 + 0.36 + 0.16 + 2.56) = 5.2. Within Line C, deviations from 6.8 give (0.64 + 0.04 + 0.64 + 1.44 + 0.04) = 2.8. Total SS_W = 5.2 + 5.2 + 2.8 = 13.2.

Mean squares. MS_B = 45.73 / (3 − 1) = 22.87. MS_W = 13.2 / (15 − 3) = 1.10.

F. 22.87 / 1.10 ≈ 20.79 on (2, 12) degrees of freedom. Look that up in an F-table — or feed the data into the ANOVA calculator — and the right-tailed p-value is about 0.00014. That is well below the 0.01 threshold, so reject the null and conclude that at least one of the three production lines produces widgets with a different mean strength than the others. Eta-squared is 45.73 / (45.73 + 13.2) = 45.73 / 58.93 ≈ 0.776: more than three-quarters of the total variation is explained by which line a widget came from. That is a very large effect — small wonder the F statistic blew through the threshold.

ANOVA does not tell you which line is the odd one out. It only tells you that the three are not all the same. Follow-up pairwise comparisons — Tukey HSD or a Bonferroni-adjusted t-test — are the standard next step. In this example, eyeballing the means makes it obvious that Line B is the high-mean outlier, but in general the follow-up tests are how you carve up the difference.

The assumptions behind the F-test

ANOVA is built on three assumptions, and you should at least skim them before trusting the p-value.

Independence of observations

Every observation must be an independent random draw from its group's population. If observations are clustered — repeated measurements on the same person, time-series within a single subject, students nested inside classrooms — one-way ANOVA is the wrong test. Repeated-measures ANOVA, mixed-effects models, or hierarchical models are designed for that case. Violating independence is the single most damaging assumption breach; non-independent data can produce wildly significant p-values without any real effect.

Approximate normality within groups

Each group's population is assumed to be normally distributed (Gaussian). In practice the F-test is reasonably robust to moderate non-normality, especially when sample sizes are roughly equal, because of the central limit theorem. Heavy-tailed distributions and strong skew are the problem cases. If your data is plainly non-normal — counts, ranks, lifetimes with a fat tail — consider the Kruskal–Wallis test, which is the non-parametric analogue of one-way ANOVA and operates on the ranks of the data instead of the raw values.

Equal variances across groups (homoscedasticity)

The populations are assumed to share a common variance σ². ANOVA is robust to mild violations when sample sizes are equal, but sensitivity rises sharply when both the variances and the sample sizes vary. Levene's test or Bartlett's test will check this formally; in practice, plotting the within-group spreads is enough to spot a problem. If variances are clearly unequal, use Welch's ANOVA — the heteroscedastic version — which adjusts the degrees of freedom to compensate. The NIST/SEMATECH e-Handbook covers the adjustment in detail.

Why not just run several t-tests?

The most common question new statistics students ask, and a fair one. The answer is multiplicity. A single two-sample t-test at α = 0.05 has a 5% chance of producing a false positive when the null is true. Run three pairwise comparisons across three groups and the chance that at least one is a false positive is roughly 1 − 0.95³ ≈ 14%. Run ten pairwise comparisons across five groups and the chance climbs to about 40%. The "family-wise" error rate balloons with the number of tests.

ANOVA bundles the whole question into a single test that controls the overall α at the level you chose. That is its main reason for existing. If ANOVA does reject, you then use a procedure designed for multiple comparisons — Tukey HSD, Bonferroni, Scheffé — to identify which pairs actually differ, with the appropriate adjustment. The flow is "global test first, follow-ups second", and skipping the global test is a recipe for false positives.

Special case: with exactly two groups, a one-way ANOVA is mathematically equivalent to a pooled-variance two-sample t-test. F equals t² and the p-values match exactly. Either test works fine. The ANOVA framing only adds value once you have three or more groups.

Reading the output

The output of any ANOVA — printout, calculator, spreadsheet — is usually a table. The ANOVA calculator renders it the standard way, with one row per source of variation.

Source. Between groups (also called "treatment" or "model"), within groups ("error" or "residual"), and total.
SS. Sum of squares, the un-normalised variation.
df. Degrees of freedom: k − 1, N − k, N − 1 respectively.
MS. Mean square = SS / df.
F. Ratio MS_B / MS_W, on the "between" row only.
p-value. Right tail of the F-distribution.

Alongside the table, two summary numbers are worth knowing. Eta-squared (η² = SS_B / SS_T) is the proportion of total variance explained by group membership; it is the descriptive analogue of R² in regression. Cohen's rough benchmarks are η² = 0.01 (small), 0.06 (medium), 0.14 (large), but the right benchmark depends on the field. The "significance at 0.05" and "significance at 0.01" flags are convenience markers — the underlying p-value is the real answer.

Common mistakes

Treating ANOVA as a post-hoc test

ANOVA tells you that some difference exists. It does not tell you which groups differ. People sometimes glance at the means, pick the largest one, and declare "Group 3 is significantly different" on the strength of the ANOVA alone. That is not what the test says. Always pair a significant ANOVA with a multiple-comparisons procedure if you want pairwise conclusions.

Ignoring effect size

With enough data, even a meaningless difference can be statistically significant. A p-value of 0.001 across groups whose means differ by a millimetre is technically real but practically nothing. Always report eta-squared (or another effect-size measure) alongside the p-value. The ANOVA calculator shows both; treat them as a pair, not a hierarchy.

Forgetting about independence

Repeated measurements on the same subject, students nested in classrooms, repeat batches from the same line in the same week — these are not independent and one-way ANOVA is the wrong tool. Reach for repeated-measures ANOVA, mixed-effects models, or a properly specified linear model with random effects.

Running ANOVA on highly unequal sample sizes with unequal variances

This is the worst case for the F-test's robustness. If your group sizes vary by more than about a factor of two and the variances also vary, the nominal p-value can be quite far from the true Type-I error rate. Welch's ANOVA or a non-parametric alternative is the safer call.

How ANOVA fits into the wider statistics toolkit

One-way ANOVA is the simplest case of a much larger family. Two-way ANOVA crosses two categorical factors and tests for two main effects plus their interaction. Repeated-measures ANOVA handles within-subject designs. ANCOVA layers in continuous covariates. MANOVA handles multiple correlated outcomes. Every one of these is a special case of the general linear model, and modern practice often skips the "ANOVA" terminology entirely in favour of fitting a linear model and reading off the same partitioning of variance. R's aov() and lm() produce identical F-tests for the same design; Python's statsmodels does the same.

For most data-analysis tasks, the recipe is: visualise the groups first (box plot, dot plot, mean-with-error-bar), then run the global ANOVA, then run pairwise comparisons if it rejects, then report effect size. Variance and standard deviation are the building blocks underneath — see the standard deviation calculator for the per-group spread, and the p-value calculator for single-statistic p-value conversions across z, t and χ².

When to seek expert help

ANOVA is a workhorse, but its assumptions are real and a few designs need more than the one-way version delivers. Reach for a statistician — or at least a dedicated package — if you have repeated measurements on the same subjects, a nested or hierarchical design (students within schools, leaves within plants), strongly unequal variances combined with unequal sample sizes, missing data that is not missing completely at random, or multiple correlated outcomes. The ANOVA calculator handles the clean one-way case for between two and five groups; anything beyond that benefits from a tool that can fit mixed-effects models.

Frequently asked questions

What is the difference between one-way and two-way ANOVA?

One-way ANOVA has one categorical factor with several levels (which production line, which drug arm, which teaching method). Two-way ANOVA has two categorical factors crossed together — for example, drug (A, B, C) crossed with sex (male, female) — and tests three things at once: the main effect of the first factor, the main effect of the second, and the interaction between them. The interaction asks whether the effect of one factor depends on the level of the other. The Calc Dragon ANOVA calculator handles the one-way case only.

How do I report ANOVA results in a paper?

Conventional reporting style is "F(df_B, df_W) = X, p = Y, η² = Z." For the worked example above that becomes "F(2, 12) = 20.79, p < 0.001, η² = 0.78." Round F to two decimals, round p to three (or report "< 0.001" for very small values), and round η² to two. Some journals also ask for the full ANOVA table; the SS / df / MS layout above is the standard format.

What if my F statistic is exactly 1?

F ≈ 1 means MS between ≈ MS within: the spread of the group means is no bigger than the spread within each group. That is exactly what you would expect under the null hypothesis. The p-value will be roughly 0.5 (close to it, depending on the df), and you have no evidence that the groups differ. Report the test anyway — non-significant results are valid findings, especially if the design was pre-specified.

Can I run ANOVA on percentages or proportions?

Cautiously. Percentages bounded between 0 and 100 are not normal; proportions near 0 or 1 have variance that depends on the mean, violating equal-variance. For mild cases (percentages mostly in the 30–70 range) ANOVA is fine. For percentages near the boundaries, transform with logit or arcsine-square-root first, or switch to a logistic regression / generalised linear model with a binomial response. The same caveat applies to count data near zero, which is better handled by a Poisson or negative-binomial GLM.

What is the relationship between ANOVA and regression?

They are the same model. One-way ANOVA with k groups is identical to a linear regression with k − 1 dummy variables (one per non-reference group) plus an intercept. The F-test from ANOVA is the overall F-test for the regression model versus an intercept-only model. Eta-squared from ANOVA equals R² from the regression. Modern practice usually fits the linear model and reads off both interpretations from the same output.

How large should each group be?

Larger is always better for power, but ANOVA can be run with as few as two observations per group if you only have two groups (technically a t-test) and three observations per group is the practical floor for three groups. For meaningful inference, statistics texts often suggest at least 10–30 per group; for small-effect detection you might need hundreds. The sample size calculator can do the power-based planning; aim for equal group sizes whenever you can because that maximises power and minimises sensitivity to unequal variances.

Where does the F-distribution come from?

The F-distribution is the ratio of two independent chi-squared random variables, each divided by its degrees of freedom. Under the null hypothesis, MS_B and MS_W are independent estimators of the same population variance σ², each proportional to a chi-squared. Their ratio cancels σ² and lands on an F-distribution with (k − 1, N − k) degrees of freedom. Tables of F values were the standard reference in the pre-computer era; today, the calculator evaluates the regularised incomplete beta function — the same special function that underlies the F-distribution's CDF — and returns the exact tail probability.

Related calculators

ANOVA Calculator — paste up to five groups and read the F, p-value, η² and the full ANOVA table.
Standard Deviation Calculator — sample and population spread for any single set of numbers.
P-Value Calculator — convert a z, t, F or chi-squared statistic to a tail probability.
Confidence Interval Calculator — normal and t intervals for the mean of a single sample.
Sample Size Calculator — power-based sample-size planning before you run the experiment.
Average Calculator — mean, median, mode and range for a single list of numbers.
Five-Number Summary Calculator — min, Q1, median, Q3 and max in one pass.