P-Values: How They Are Calculated, What They Mean, and How Not to Misread Them

A p-value is the probability of seeing a test statistic at least as extreme as the one you got, assuming the null hypothesis is true. This guide explains when to use the z, t and chi-squared distributions, walks through the cumulative-distribution maths the calculator runs, works through a two-sample t-test sitting right on the 0.05 boundary, and lists the traps — one-tailed flipping, multiple comparisons, small-n z-tests — that turn an honest analysis into an irreproducible one.

#math#statistics#p-value#hypothesis-test#z-test#t-test#chi-squared

What a p-value actually is

A p-value is the probability of observing a test statistic at least as extreme as the one you got, assuming the null hypothesis is true. That phrasing is precise and almost nobody remembers it correctly. It is not the probability that the null hypothesis is true, it is not the probability your result is due to chance, and it is not one minus the probability your alternative hypothesis is true. The p-value calculator on this site takes the test statistic your software already gave you (a z, t or chi-squared) and returns the matching p-value plus a verdict at α = 0.05 and α = 0.01 — but understanding what the number actually represents is the harder half of the job.

The clearest way to think about it: pretend the null hypothesis is exactly true. Now ask, "If I repeated this experiment many times, how often would I see a statistic as far from the centre of the null distribution as the one I observed, or further?" A p-value of 0.03 means that under the null, the observed result (or anything more extreme) would happen roughly three times in a hundred. That is rare enough that most researchers will reject the null at the conventional 5 % threshold — but rare is not impossible, and a low p-value with a small effect size is the standard recipe for an unreplicable finding.

The three distributions the calculator covers

The p-value calculator supports the three null distributions that cover something like 95 % of practical hypothesis tests: the standard normal (z), Student's t, and chi-squared (χ²). Each has its own conditions and a different way of consuming degrees of freedom.

z — standard normal

Use z when the population standard deviation σ is known, or when the sample size is large enough (n ≥ 30 is a common rule of thumb) that the sample standard deviation is a good proxy for σ. One-proportion z-tests and large-sample mean tests use this distribution. The standard normal has no degrees-of-freedom parameter — its shape is fixed, with mean 0 and variance 1. The calculator ignores the df field when you pick z.

t — Student's t

Use t when σ is unknown and you have to estimate it from the sample. One-sample t-tests, two-sample t-tests, and the coefficient tests inside an OLS regression all live here. The t-distribution looks like a normal distribution with fatter tails — the fatness comes from the extra uncertainty in estimating σ — and it converges to the normal as the degrees of freedom rise. For a one-sample t-test, df is n − 1. For Welch's two-sample t-test, df is the Welch–Satterthwaite approximation that your software prints. For an OLS regression coefficient, df is the residual degrees of freedom, usually n − k − 1.

χ² — chi-squared

Use χ² for goodness-of-fit tests, tests of independence in contingency tables, and one-sample variance tests. The chi-squared distribution is strictly non-negative and right-skewed, with df controlling both the mean (df) and the skew (it becomes more symmetric as df rises). For a goodness-of-fit test with k categories, df is k − 1. For an r × c contingency table, df is (r − 1)(c − 1). The p-value calculator locks the tail to right-tailed for χ² because all standard chi-squared tests are right-tailed by construction — the statistic is a sum of squared standardised differences, so small values mean "observed counts match expected" rather than "evidence against the null in the other direction".

How the calculator computes the p-value

For each test, the calculator evaluates the cumulative distribution function (CDF) F of the relevant null distribution at your statistic, then takes the appropriate tail.

For z, F is the standard normal CDF Φ, implemented via the Abramowitz & Stegun 7.1.26 erf approximation. The maximum absolute error of that approximation is about 1.5 × 10⁻⁷, well below the four decimal places the calculator reports.

For t, F uses the regularised incomplete beta function Ix(a, b): FT(t; ν) = 1 − ½ · Iν/(ν+t²)(ν/2, ½) for t ≥ 0, with the symmetric reflection for t < 0. The incomplete beta is evaluated via Lentz's continued fraction with a relative tolerance of 10⁻¹⁴ and up to 400 iterations.

For χ², F is the regularised lower incomplete gamma P(k/2, x/2). The same continued-fraction machinery handles the gamma function for values past the series-expansion crossover. Across the usual statistical range, the outputs match R's pnorm, pt and pchisq, Python's scipy.stats.norm.cdf, t.cdf and chi2.cdf, and a TI-84 to at least four decimal places.

Tail conversion is straightforward once you have F. Right-tailed p is 1 − F(statistic). Left-tailed p is F(statistic). Two-tailed p is 2 · min(F, 1 − F) when the null distribution is symmetric (z and t) — which doubles whichever of the two one-tailed p-values is smaller.

Worked example: a two-sample t-test on the edge

You have run a two-sample t-test in your spreadsheet — group A has 6 observations, group B has 6 observations — and the formula returns t = 2.228 with df = 10. You want a two-tailed p-value. Open the p-value calculator, select t, enter 2.228, df = 10, tail = two. The result is p ≈ 0.0500 — the textbook 5 % critical value for t(10), which sits right on the boundary of significance at α = 0.05 and is not significant at α = 0.01.

Now suppose you had ignored the small sample and used z instead. Switch the calculator to z, statistic 2.228, two-tailed. You get p ≈ 0.0259 — about half the t-test p-value. The lesson is concrete: when df is small, the t-distribution has noticeably heavier tails than the standard normal, and reading off a z table when you should have used a t table can roughly halve the p-value. That is one of the most common ways an analyst overstates significance.

Try a chi-squared example as well. A goodness-of-fit test on a six-sided die rolled 60 times produced χ² = 11.07 with df = 5. Switch to χ², enter 11.07, df = 5. The calculator returns p ≈ 0.0500 — again, the textbook critical value. Roll the die a hundred more times and the χ² statistic would have to be even further from zero to keep the p-value the same, because the distribution's mean equals df.

Factors that affect the p-value

Sample size

The single most important driver. Standard errors shrink with √n, so test statistics grow with √n for any fixed underlying effect. Double the sample and a borderline p-value of 0.10 will, all else equal, drop to roughly 0.03. This is why a tiny effect can still become "statistically significant" with a large enough sample — the famous problem with p-values in the era of n = 100,000 datasets.

Effect size

The bigger the true effect, the further the test statistic sits from the null and the smaller the p-value. Effect size is the part of the picture the p-value does not tell you on its own. Always report a measure of effect size — Cohen's d, an odds ratio, a regression coefficient with a confidence interval — alongside the p-value. A p-value answers "is it real?", not "is it big?".

Variability in the data

Larger sample standard deviation means larger standard error means a smaller test statistic for the same effect. Heavy tails, outliers and skewed distributions can all inflate the denominator. The fix is rarely "switch to a one-tailed test"; more often it is "use a robust estimator", "transform the variable" or "use a non-parametric test like Mann–Whitney or the sign test".

Choice of tail

A one-tailed test gives you a p-value half the size of the two-tailed equivalent, but only if your alternative hypothesis was directional before you collected the data. Choosing one-tailed after seeing the result so you can halve the p-value is data dredging — it doubles the false-positive rate at any given threshold, and any competent reviewer will spot it. Default to two-tailed unless you have a pre-registered directional hypothesis.

Degrees of freedom (for t and χ²)

The shape of the t and χ² distributions depends on df. With only 2 or 3 df, the t-distribution's tails are dramatically heavier than the normal's, so the same statistic yields a much larger p-value. Get df wrong by passing n instead of n − 1 to a one-sample t-test and your p-value will be too small. The calculator does not check your df — it trusts what you enter — so make sure the value matches whatever your test setup expects.

How to interpret a p-value without falling into the usual traps

  • Treat 0.05 as a convention, not a law. The 5 % threshold dates to Fisher in the 1920s and is not grounded in any deep statistical theory. Pre-register the threshold you will use, justify it for the problem at hand, and report the actual p-value rather than just "p < 0.05".
  • Always pair with effect size and a confidence interval. The American Statistical Association's 2016 statement on p-values is explicit on this. The confidence interval calculator gives you a plausible range for the parameter you are estimating; the p-value alone does not.
  • Distinguish "not significant" from "no effect". p = 0.20 with a sample of 10 might easily hide a real effect — you simply did not have the power to detect it. Use a sample size calculator ahead of time to plan for the smallest effect you care about.
  • Correct for multiple comparisons. Run 20 independent tests at α = 0.05 and on average you will get one significant result by chance even if every null is true. Bonferroni, Holm or Benjamini–Hochberg corrections control the family-wise or false-discovery rate accordingly.
  • Pre-register the analysis. Decide before looking at the data which test you will run, which variables go in, and what counts as a positive finding. Changing the analysis after seeing the data is the single biggest source of irreproducible results in the published literature.
  • Replicate. A single p < 0.05 is weak evidence on its own, no matter how the marketing materials phrase it. Independent replication is what turns a statistical curiosity into a finding worth acting on.

Common mistakes

"p-value = probability the null is true"

This is the textbook misinterpretation, and it is wrong. The p-value is a probability about the data, conditional on the null being true — not a probability about the null, conditional on the data. The latter is a Bayesian posterior and requires a prior. If you want a probability that the hypothesis is true, you are asking a different question with a different machinery (Bayes factors, posterior probabilities).

Using z when n is small

For small samples, the t-distribution's heavier tails produce noticeably larger p-values than the normal does for the same statistic. Reaching for a z table with n = 8 is a reliable way to halve your p-value — and a reliable way to get caught by a reviewer who actually reads the methods section.

Switching from two-tailed to one-tailed after the fact

This is the single most common form of p-hacking. If the alternative hypothesis was directional from the start, say so in the pre-registration. If it was not, two-tailed is the only honest choice.

Treating p = 0.049 and p = 0.051 as fundamentally different

They are not. The probability ratio is essentially 1, and the conventional cliff at 0.05 has no physical meaning. Two adjacent p-values straddling the threshold contain virtually the same evidence. A more useful framing is the full likelihood-ratio or Bayes factor; failing that, report the exact p-value and let the reader weigh it.

When to seek professional advice

Calculator-level p-values are appropriate for one-off analyses where the test type and degrees of freedom are obvious. For high-stakes work — clinical trials, regulatory submissions, A/B tests that drive product decisions at scale — work with a statistician on study design, power calculations, multiple-testing corrections and pre-registration. The maths in this page is straightforward; the harder calls are about what to test, what to control for, and what to do when the assumptions of the test do not hold.

Frequently asked questions

What does a p-value actually tell me? It is the probability of observing a test statistic at least as extreme as yours, assuming the null hypothesis is true. It is not the probability that the null is true, and it is not the probability your result is due to chance. A small p-value means your data are unlikely under H₀ — it does not say anything about effect size, practical importance, or the probability that any specific alternative is correct.

When should I use a one-tailed vs a two-tailed test? Use two-tailed whenever your alternative is "different from" (H₁: μ ≠ μ₀). Use one-tailed only when your alternative is directional and you specified the direction before looking at the data. Switching after the fact to halve the p-value is data dredging and inflates the false-positive rate. When in doubt, go two-tailed.

Why does my p-value not match what R or scipy gives? For typical statistics the p-value calculator agrees with R's pnorm/pt/pchisq, Python's scipy.stats and a TI-84 to at least four decimal places. If you see a mismatch, check three things: the tail (some software returns one-tailed by default), the sign of the statistic (matters for one-tailed tests), and the degrees of freedom (n − 1 vs n is a classic slip).

Why is the χ² option locked to one-tailed (right)? Standard chi-squared tests are right-tailed by construction. The statistic is a sum of squared standardised differences, so larger values always mean stronger evidence against H₀, and small values just mean observed counts are close to expected — never a reason to reject H₀. Software that exposes a left-tailed χ² is testing something else (e.g. a variance test against a lower bound).

Is p < 0.05 the same as "the result matters"? No. With a large enough sample, any trivial effect will eventually reach significance. Report the effect size and a confidence interval alongside the p-value so the reader can judge both whether the effect is real and whether it is large enough to act on.

How precise are these p-values? The incomplete gamma and incomplete beta routines use Lentz's continued fraction with a 10⁻¹⁴ relative tolerance and up to 400 iterations — enough to converge to machine precision across the usual statistical range. The Abramowitz & Stegun erf used for the normal CDF has |error| ≤ 1.5 × 10⁻⁷. None of these approximation errors are visible at four decimal places.

What is the difference between α and the p-value? α (the significance level) is a threshold you set before looking at the data — typically 0.05 — and represents your tolerated false-positive rate. The p-value is what you compute after looking at the data. You reject H₀ if the p-value is less than α. They are different numbers playing different roles, even though they live on the same scale.

Where does the 0.05 threshold come from? Fisher proposed 0.05 as a convenient cut-off in his 1925 book Statistical Methods for Research Workers, partly because pre-computed tables of the normal and t distributions at that level were what he had to hand. There is nothing fundamental about it, and many fields — particle physics famously uses 5σ ≈ 3 × 10⁻⁷ — pick a much stricter threshold when the stakes warrant it.

Frequently asked questions

What does a p-value actually tell me?

It is the probability of observing a test statistic at least as extreme as yours, assuming the null hypothesis is true. It is not the probability that the null hypothesis is true, and it is not the probability that your result is due to chance. A small p-value means your data are unlikely under H₀ — it does not by itself say anything about effect size, practical importance, or the probability that any specific alternative hypothesis is correct.

When should I use a one-tailed vs a two-tailed test?

Use two-tailed whenever your alternative hypothesis is "different from" (H₁: μ ≠ μ₀). Use one-tailed only when your alternative is directional and you specified the direction before looking at the data. Switching to one-tailed after seeing the result so you can halve the p-value is data dredging — it inflates the false-positive rate and most journals reject it on sight. When in doubt, go two-tailed.

Why does my p-value not match what R or scipy gives?

For typical statistics the calculator agrees with R's pnorm/pt/pchisq, Python's scipy.stats.norm.cdf/t.cdf/chi2.cdf and a TI-84 to at least four decimal places. If you see a mismatch, check three things. First, the tail — some software returns one-tailed p by default. Second, the sign of your statistic — for one-tailed tests this matters. Third, degrees of freedom — make sure you are passing n − 1 (or the right residual df), not n.

Why is the χ² option locked to one-tailed (right)?

Standard chi-squared tests are right-tailed by construction. The test statistic is a sum of squared standardised differences, so larger values always mean stronger evidence against H₀, and small values just mean the observed counts are close to expected — never a reason to reject H₀. Software that exposes a left-tailed χ² is testing something else entirely (e.g. a variance test against a lower bound), rare enough that this calculator does not surface it.

Is p < 0.05 the same as "the result matters"?

No. The 0.05 threshold is a convention, not a law of nature, and with a large enough sample any trivial effect will eventually reach significance. A p-value below 0.05 with a large sample can correspond to an effect too small to act on; a p-value above 0.05 with a small sample can hide an important effect. Report the effect size and a confidence interval alongside the p-value — that combination tells you both whether the effect is real and whether it is big enough to care about.

How precise are these p-values?

The incomplete gamma and incomplete beta routines use Lentz's continued fraction with a 10⁻¹⁴ relative tolerance and up to 400 iterations, enough to converge to machine precision across the usual statistical range. The Abramowitz & Stegun erf used for the normal CDF has |error| ≤ 1.5 × 10⁻⁷. For p-values reported to four decimal places, none of these approximation errors are visible.

What is the difference between α and the p-value?

α (the significance level) is a threshold you set before looking at the data — typically 0.05 — and represents your tolerated false-positive rate. The p-value is what you compute after looking at the data. You reject H₀ if the p-value is less than α. They live on the same scale but play different roles.

Where does the 0.05 threshold come from?

Fisher proposed 0.05 as a convenient cut-off in his 1925 book Statistical Methods for Research Workers, partly because the pre-computed normal and t tables he used were laid out at that level. There is nothing fundamental about it; particle physics uses 5σ ≈ 3 × 10⁻⁷, and other fields pick stricter or looser thresholds depending on the cost of a false positive.

Informational only. Not personalised financial, legal, or tax advice.