Tests for Means
Tests for means compare observed sample means with null hypotheses about population means. They are among the most common procedures in introductory statistics because many variables are quantitative: reaction times, exam scores, blood pressure, monthly revenue, machine fill volume, and so on. The Lane text discusses tests of a single mean, independent-group differences, and correlated-pair differences as separate cases because the standard error changes with the design.
The central question is not simply how many means are being compared. It is whether observations are independent, paired, or grouped. A before-after study on the same people is not an independent two-sample study. A treatment-control comparison with different people in each group is not a paired study. Choosing the wrong test means choosing the wrong standard error, which changes the p-value and interval.
Definitions
A one-sample test compares a sample mean with a hypothesized population mean :
with under the usual assumptions.
An independent two-sample test compares means from two unrelated groups. The Welch statistic is
where is the null difference, usually 0. Welch's test uses an approximate degrees of freedom formula and does not require equal variances.
A pooled two-sample test assumes equal population variances. It estimates a common variance by
Because equal variances are often uncertain, Welch's test is a safer default in many applied settings.
A paired test compares two measurements taken on matched units or the same unit twice. Define differences
Then test the mean difference with a one-sample test on the values:
The usual null is .
Key results
The design determines the denominator. For a one-sample mean, the standard error is . For independent groups, the standard error combines two independent sources of variation:
For paired data, the standard error uses the variability of differences:
Pairing is useful when units differ substantially at baseline but within-unit changes are measured precisely. If each patient serves as their own control, the paired analysis removes some between-person variation.
Assumptions for procedures include independent sampling or independent pairs, quantitative response data, and a sampling distribution of the mean or mean difference that is approximately normal. For small samples, the raw data or differences should not show severe skewness or extreme outliers. For large samples, the central limit theorem makes methods more robust, though dependence and biased sampling remain serious problems.
A test for means should usually be accompanied by a confidence interval and an effect size. A statistically significant difference of 0.2 points on a 100-point scale may be unimportant; a non-significant difference of 8 points in a small pilot study may still be practically promising.
The same sample summaries can lead to different conclusions when the research question changes. If the goal is quality control, a one-sample test may compare a process mean with a fixed target. If the goal is a treatment comparison, an independent or paired design determines the analysis. If the goal is equivalence or noninferiority, the null and alternative are reversed from the usual "no difference" test, and a standard two-sided significance test is not enough. For introductory work, the main discipline is to write the parameter and hypotheses in words before calculating. That step exposes whether the mean, mean difference, or paired mean difference is really the quantity of interest.
Graphical checks should use the unit of analysis. For paired data, graph the differences, not just the two marginal distributions before and after. For independent groups, compare group histograms or box plots and look for severe imbalance in spread, skewness, or outliers. These checks do not replace the test, but they explain whether the test is summarizing the data in a defensible way.
Visual
| Design | Data structure | Statistic | Degrees of freedom |
|---|---|---|---|
| One-sample | one quantitative variable | ||
| Independent groups | separate groups | Welch | approximate |
| Equal-variance groups | separate groups | pooled | |
| Paired | differences within pairs | pairs |
Worked example 1: Welch two-sample test
Problem: A study compares weekly exercise minutes for two independent groups. Group A has , , and . Group B has , , and . Test at whether the population means differ.
Method:
- State hypotheses:
- Difference in sample means:
- Welch standard error:
- Compute components:
- Continue:
- Test statistic:
- Welch degrees of freedom are approximate. Software gives about for these values.
- The two-sided p-value for with about 34 degrees of freedom is about 0.030.
Answer: Reject at the 0.05 level. The data provide evidence that the population mean weekly exercise minutes differ, with Group A higher by about 24 minutes in the sample.
Checked answer: The standard error is about 10.6, so the observed difference is a little more than two standard errors from zero, consistent with a p-value below 0.05.
Worked example 2: Paired test on before-after data
Problem: Eight participants complete a memory task before and after a training session. Scores are:
| Participant | Before | After |
|---|---|---|
| 1 | 18 | 22 |
| 2 | 20 | 23 |
| 3 | 17 | 19 |
| 4 | 21 | 24 |
| 5 | 19 | 21 |
| 6 | 16 | 20 |
| 7 | 22 | 25 |
| 8 | 18 | 20 |
Test whether mean score improved.
Method:
- Compute differences after minus before:
- Mean difference:
- Deviations from 2.875 are .
- Squared deviations sum to
- Sample variance of differences:
- Standard deviation:
- Standard error:
- Test statistic for versus :
- Degrees of freedom:
Answer: The improvement is statistically significant by any common significance level. The paired structure shows a consistent positive gain for every participant, with mean improvement 2.875 points.
Checked answer: Because all eight differences are positive and tightly clustered between 2 and 4, a very large statistic is plausible.
Code
import numpy as np
from scipy import stats
# Welch two-sample test from summaries
t_stat, p_value = stats.ttest_ind_from_stats(
mean1=142, std1=35, nobs1=18,
mean2=118, std2=30, nobs2=20,
equal_var=False,
)
print(f"Welch t = {t_stat:.3f}, p = {p_value:.4f}")
# Paired test from raw before-after scores
before = np.array([18, 20, 17, 21, 19, 16, 22, 18])
after = np.array([22, 23, 19, 24, 21, 20, 25, 20])
diff = after - before
print("mean difference:", diff.mean())
print(stats.ttest_rel(after, before, alternative="greater"))
The independent-group call uses only summary statistics; the paired call uses raw paired scores. That distinction mirrors the design distinction in the formulas.
Common pitfalls
- Using an independent two-sample test for paired before-after data.
- Using a paired test for two unrelated groups because the sample sizes happen to match.
- Assuming non-significance proves equal means.
- Choosing the pooled test without checking whether equal variances are plausible.
- Ignoring outliers in small samples, where a single value can dominate the mean and standard deviation.
- Reporting the p-value without the observed mean difference and confidence interval.