Skip to main content

Summarizing Distributions

Numerical summaries compress a distribution into a few interpretable quantities. The Lane text treats central tendency, variability, percentiles, and shapes of distributions as foundational because later inference depends on them. A confidence interval for a mean uses xˉ\bar{x} and ss; a z-score uses a mean and a standard deviation; an ANOVA partitions variability; regression measures how much variability is explained by a line.

Compression always loses information. A mean does not show skewness, a standard deviation does not show outliers, and a percentile does not show whether nearby values are common or rare. Good summary work therefore pairs numbers with a graph and chooses statistics whose meanings fit the distribution shape and measurement level.

Definitions

The mean of nn observations x1,,xnx_1,\dots,x_n is

xˉ=1ni=1nxi.\bar{x}=\frac{1}{n}\sum_{i=1}^{n}x_i.

It is the balance point of the data and uses every value. The median is the middle value after sorting, or the average of the two middle values when nn is even. It is resistant to extreme values. The mode is the most frequent value or category. A distribution can be unimodal, bimodal, multimodal, or have no repeated value.

The range is max(x)min(x)\max(x)-\min(x). The interquartile range is IQR=Q3Q1IQR=Q_3-Q_1, the width of the middle half of the data. The sample variance is

s2=i=1n(xixˉ)2n1,s^2=\frac{\sum_{i=1}^{n}(x_i-\bar{x})^2}{n-1},

and the sample standard deviation is s=s2s=\sqrt{s^2}. The denominator n1n-1 gives the usual unbiased estimator of population variance under random sampling.

A percentile indicates relative standing. The 80th percentile is a value at or below which about 80% of observations fall. A z-score standardizes an observation:

z=xμσz=\frac{x-\mu}{\sigma}

for a population, or approximately

z=xxˉsz=\frac{x-\bar{x}}{s}

when using sample summaries descriptively. Positive z-scores are above the mean; negative z-scores are below.

Skewness describes asymmetry. A right-skewed distribution has a long right tail and often has mean greater than median. A left-skewed distribution has a long left tail and often has mean less than median. Kurtosis describes tail heaviness or peakedness relative to a reference distribution, though in introductory work it is more important to recognize outliers and tail behavior visually than to memorize a single kurtosis rule.

Key results

The mean is sensitive to linear transformations. If every observation is transformed by yi=a+bxiy_i=a+bx_i, then

yˉ=a+bxˉ.\bar{y}=a+b\bar{x}.

The standard deviation is affected by scale but not by shifts:

sy=bsx.s_y=|b|s_x.

Adding 10 points to every exam score raises the mean and median by 10 but leaves standard deviation and IQRIQR unchanged. Multiplying every measurement by 2 doubles the mean, median, standard deviation, and IQRIQR.

The sample variance can be computed from deviations:

i=1n(xixˉ)=0.\sum_{i=1}^{n}(x_i-\bar{x})=0.

This identity explains why squared deviations are used: simple deviations always cancel around the mean. Squaring makes distances positive and gives larger penalties to observations far from the center.

For symmetric unimodal data without strong outliers, mean and standard deviation are usually informative. For skewed data or data with outliers, the median and IQRIQR often describe typical values better. For nominal categorical data, the mode and proportions are meaningful, while means and standard deviations of arbitrary category codes are not.

The empirical rule applies approximately to bell-shaped distributions:

68% of values lie within 1 standard deviation of the mean95% of values lie within 2 standard deviations of the mean99.7% of values lie within 3 standard deviations of the mean.\begin{aligned} 68\% &\text{ of values lie within } 1 \text{ standard deviation of the mean} \\ 95\% &\text{ of values lie within } 2 \text{ standard deviations of the mean} \\ 99.7\% &\text{ of values lie within } 3 \text{ standard deviations of the mean}. \end{aligned}

This rule is descriptive and approximate. It should not be applied blindly to strongly skewed, bounded, discrete, or multimodal distributions.

Summary choice should also follow the decision being made. If a city reports household income to describe the "typical" resident, the median is often preferred because a few very high incomes can pull the mean upward. If a factory monitors fill weights from a stable machine, the mean and standard deviation are often more useful because symmetric random variation around a target is expected. If a teacher reports class performance, the median, quartiles, and score distribution may be more informative than the mean alone. The right summary is the one that preserves the feature of the data most relevant to the question while making its limitations visible.

For any numerical summary, attach units and sample context. A standard deviation of 4 means very different things if the unit is seconds, dollars, kilograms, or points on a 5-point scale. Likewise, a median computed from 12 observations should be presented more cautiously than a median computed from 12,000 observations collected by a careful sampling design.

Visual

SummaryFormula or ruleResistant to outliers?Best use
Meanxˉ=xi/n\bar{x}=\sum x_i/nNoSymmetric quantitative data
Medianmiddle sorted valueYesSkewed quantitative or ordinal data
Modemost frequent valueOftenCategorical data, repeated values
Rangemaxmin\max-\minNoQuick total spread
IQRQ3Q1Q_3-Q_1YesMiddle spread, box plots
Standard deviations=(xixˉ)2/(n1)s=\sqrt{\sum(x_i-\bar{x})^2/(n-1)}NoTypical distance for roughly symmetric data
z-score(xxˉ)/s(x-\bar{x})/sNoRelative standing on a common scale
Mean vs median in a right-skewed distribution

frequency
^
| #####
| #########
| ############
| ##########
| ######
| ###
| #
+--------------------------------> value
median mean

Worked example 1: Mean, median, variance, and standard deviation

Problem: A small lab records the number of minutes needed to process seven samples:

9, 10, 10, 11, 12, 13, 19.9,\ 10,\ 10,\ 11,\ 12,\ 13,\ 19.

Find the mean, median, sample variance, and sample standard deviation. Comment on the high value 19.

Method:

  1. Add the observations:
9+10+10+11+12+13+19=84.9+10+10+11+12+13+19=84.
  1. Divide by n=7n=7:
xˉ=84/7=12.\bar{x}=84/7=12.
  1. The data are already sorted. With seven values, the median is the 4th value:
median=11.\mathrm{median}=11.
  1. Compute deviations from the mean and square them:
xix_ixixˉx_i-\bar{x}(xixˉ)2(x_i-\bar{x})^2
9-39
10-24
10-24
11-11
1200
1311
19749
  1. Sum squared deviations:
9+4+4+1+0+1+49=68.9+4+4+1+0+1+49=68.
  1. Divide by n1=6n-1=6:
s2=68/6=11.3333.s^2=68/6=11.3333.
  1. Take the square root:
s=11.33333.37.s=\sqrt{11.3333}\approx 3.37.

Answer: The mean is 12 minutes, the median is 11 minutes, the sample variance is about 11.33 square minutes, and the sample standard deviation is about 3.37 minutes. The value 19 pulls the mean above the median and contributes 4949 of the 6868 squared-deviation total, so it strongly affects the standard deviation.

Checked answer: The deviations add to 3221+0+1+7=0-3-2-2-1+0+1+7=0, which confirms the mean arithmetic.

Worked example 2: Percentiles and z-scores

Problem: A student scored 86 on an exam. The class mean was 74 and the sample standard deviation was 8. Another exam in a different course had mean 62 and standard deviation 12, and the same student scored 80. On which exam was the student farther above the class average?

Method:

  1. Standardize the first score:
z1=86748=128=1.50.z_1=\frac{86-74}{8}=\frac{12}{8}=1.50.
  1. Standardize the second score:
z2=806212=1812=1.50.z_2=\frac{80-62}{12}=\frac{18}{12}=1.50.
  1. Compare the z-scores, not the raw score differences.

Answer: The student was equally far above average on both exams: 1.5 standard deviations above the mean. The first raw difference was 12 points and the second was 18 points, but the second course had more spread, so the standardized standing is the same.

Checked answer: Both standardized differences reduce to 1.501.50. If the distributions are roughly bell-shaped, a score 1.5 standard deviations above the mean is around the upper tail but not extremely rare.

Code

import numpy as np
from scipy import stats

x = np.array([9, 10, 10, 11, 12, 13, 19])

mean = x.mean()
median = np.median(x)
sample_var = x.var(ddof=1)
sample_sd = x.std(ddof=1)
z_scores = (x - mean) / sample_sd

print({"mean": mean, "median": median, "variance": sample_var, "sd": sample_sd})
print("z-scores:", np.round(z_scores, 2))
print("skewness:", stats.skew(x, bias=False))
print("kurtosis excess:", stats.kurtosis(x, bias=False))

The argument ddof=1 requests the sample variance denominator n1n-1. Without it, NumPy uses the population denominator nn, which is appropriate only when the data are the whole population being summarized.

Common pitfalls

  • Reporting only the mean for a skewed distribution where the median better represents a typical case.
  • Forgetting that variance is in squared units while standard deviation is in the original units.
  • Mixing population and sample formulas without thinking about whether the data are a full population or a sample.
  • Treating percentiles as percentages correct. A percentile is a location in a distribution, not a test score scale.
  • Applying the empirical rule to a strongly skewed or multimodal distribution.
  • Deleting high or low observations because they affect the mean, instead of investigating whether they are valid.

Connections