Linear Regression Inference
Linear regression models the mean of a quantitative response as a linear function of one or more predictors. In simple linear regression, one predictor is used to predict a response through a fitted line. The Lane text introduces regression after correlation because regression adds direction: it distinguishes the predictor from the response, estimates a slope, and supports prediction, diagnostics, and inference about the linear relationship.
A regression line is not merely a line drawn through a scatterplot by eye. It is the line that minimizes the sum of squared residuals. That least-squares criterion gives precise formulas, but the model must still be checked. Outliers, curvature, nonconstant variance, dependence, and extrapolation can make a technically correct line misleading.
Definitions
The simple linear regression model is
where is the intercept, is the population slope, and is the random error for observation . The fitted line is
The slope estimates the expected change in for a one-unit increase in . The intercept estimates the mean response when , if is meaningful and within the range of data.
The residual for observation is
Least squares chooses and to minimize
For simple linear regression,
and
The coefficient of determination is
where
In simple regression with an intercept, .
Key results
Inference for the slope commonly tests
against a one- or two-sided alternative. The test statistic is
with in simple regression. A confidence interval for the slope is
The standard error of the estimate, also called residual standard error, is
It measures typical vertical scatter around the fitted line in response units.
Regression assumptions for classical inference include linear mean structure, independent errors, constant error variance, and approximately normal errors for small-sample p-values and intervals. The predictor values need not be normally distributed. Residual plots are central: plot residuals against fitted values and predictors to check curvature and changing spread; use Q-Q plots to assess normality; inspect leverage and influence.
A confidence interval for the mean response at a given estimates the average among cases with that predictor value. A prediction interval estimates a new individual response at . Prediction intervals are wider because they include both uncertainty in the mean and individual-level noise.
Extrapolation occurs when using the line outside the observed range of . It can be very misleading because linear patterns often hold only locally.
Regression interpretation should separate prediction, explanation, and causation. A model can predict well without revealing a causal mechanism, especially when predictors are proxies for other variables. A model can estimate an association after adjustment for measured covariates, but unmeasured confounding may remain. A randomized experiment with a regression analysis can support stronger causal language because treatment assignment is controlled by design. In observational regression, use language such as "is associated with" or "predicts" unless the design and assumptions justify a causal claim.
Multiple regression extends the same idea to several predictors. A coefficient then estimates the expected change in the response for a one-unit increase in that predictor, holding the other predictors in the model constant. That phrase is powerful but easy to overstate: it means statistically adjusted within the fitted model, not physically held constant by an experiment.
Influence diagnostics ask whether one or a few observations are driving the fitted relationship. A point with an unusual value has high leverage; a point with a large residual has poor fit; a point with both can strongly change the slope. Removing such a point without justification is not acceptable, but fitting the model with and without it can reveal whether the conclusion is stable. If the story changes completely, the final analysis should say so and investigate the observation's source.
Good regression reporting includes the fitted equation, units, uncertainty for key coefficients, residual diagnostics, and the observed predictor range.
Visual
| Quantity | Formula | Interpretation |
|---|---|---|
| Slope | predicted change in per one-unit increase | |
| Intercept | predicted at | |
| Residual | vertical prediction error | |
| SSE | unexplained variation | |
| proportion of sample variation explained | ||
| Slope test | evidence of nonzero linear slope |
Worked example 1: Fitting a least-squares line
Problem: A small data set records study hours and exam scores :
| Student | ||
|---|---|---|
| A | 2 | 68 |
| B | 3 | 70 |
| C | 5 | 78 |
| D | 6 | 82 |
| E | 8 | 88 |
| F | 9 | 91 |
Find the least-squares line and predict the score for 7 hours.
Method:
- From the correlation page, the means are
- Compute
- Compute
- Slope:
- Intercept:
- Calculate:
- Fitted line:
- Predict at :
Answer: The fitted line is approximately . For 7 study hours, the predicted exam score is about 84.9.
Checked answer: The slope is positive, matching the scatterplot. The prediction at 7 hours lies between the observed scores for 6 and 8 hours, which is reasonable.
Worked example 2: Interpreting slope inference and prediction
Problem: A regression of monthly electricity cost on average daily temperature uses months. The fitted slope is dollars per degree, with . Test whether the population slope differs from 0 at , and construct a 95% confidence interval.
Method:
- State hypotheses:
- Degrees of freedom:
- Test statistic:
- A two-sided p-value with 38 degrees of freedom is about 0.0035.
- Since , reject .
- For a 95% interval with , .
- Margin of error:
- Interval:
Answer: There is statistically significant evidence of a nonzero linear slope. The estimated cost increase is $2.80 per degree, with a 95% confidence interval from about $0.98 to $4.62 per degree.
Checked answer: The interval does not include 0, agreeing with the two-sided test at . The conclusion is about association and prediction unless the data came from a design supporting causation.
Code
import pandas as pd
import statsmodels.api as sm
df = pd.DataFrame({
"hours": [2, 3, 5, 6, 8, 9],
"score": [68, 70, 78, 82, 88, 91],
})
X = sm.add_constant(df["hours"])
model = sm.OLS(df["score"], X).fit()
print(model.summary())
new_X = sm.add_constant(pd.DataFrame({"hours": [7]}), has_constant="add")
prediction = model.get_prediction(new_X)
print(prediction.summary_frame(alpha=0.05))
The summary includes slope, intercept, standard errors, p-values, and . The prediction frame includes both a confidence interval for the mean score at 7 hours and a prediction interval for an individual student.
Common pitfalls
- Interpreting the intercept when is outside the observed range or meaningless.
- Treating correlation and regression slope as the same quantity.
- Extrapolating beyond the data range because the fitted line has an equation.
- Ignoring residual plots and relying only on .
- Reading a statistically significant slope as proof of causation.
- Confusing confidence intervals for the mean response with prediction intervals for individuals.