Skip to main content

Software Metrics

Software metrics try to make important software attributes measurable. Gustafson's metrics chapter starts with measurement theory, then applies it to product metrics, process metrics, and the Goal-Question-Metric approach. The chapter is careful about validation: a number is not useful merely because it can be computed. A metric must represent an attribute in a meaningful way, and the use of the metric must be justified.

This topic sits between project management and quality assurance. Managers need measures to estimate, track, and improve work; engineers need measures that signal complexity, maintainability, and test risk. The chapter covers common size and complexity measures such as LOC, McCabe's cyclomatic number, Halstead's software science measures, Henry-Kafura information flow, productivity, and GQM.

Definitions

Software measurement is the mapping of symbols or numbers to software objects or processes so that some attribute can be quantified. Examples include mapping a module to its lines of code, mapping a control flow graph to a cyclomatic complexity number, or mapping a process to defects found per reviewer-hour.

A metric is a defined measurement used for a purpose. The purpose matters. LOC may be useful for estimating effort in a calibrated environment, but it is a weak measure of user value. A metric should be evaluated against the decision it supports.

Measurement validation asks whether the metric meaningfully represents the attribute of interest. A correlated number is not automatically valid. Shoe size may correlate with height in a population, but it is not a valid measure of height because it does not directly represent the length attribute.

Monotonicity means that increasing the empirical attribute should not decrease the measured value. For a size metric, adding more relevant code should not make the measured size smaller. Nonmonotonic metrics are hard to interpret and easy to manipulate.

Measurement scales describe what operations are meaningful:

ScaleMeaningful statementsSoftware example
Nominalequality or categorydefect type
Ordinalrankingseverity level
Intervaldifferencescyclomatic complexity differences
Ratioratios and zero pointLOC, elapsed time
Absolutedirect countnumber of modules

McCabe's cyclomatic number measures control-flow complexity. For a control flow graph:

C=en+2pC = e - n + 2p

where ee is the number of edges, nn is the number of nodes, and pp is the number of strongly connected components, normally 1 for a single routine. For many structured programs, CC is also one more than the number of decisions.

Halstead's measures treat a program as operators and operands. Let n1n_1 be the number of distinct operators, n2n_2 the number of distinct operands, N1N_1 the total operator occurrences, and N2N_2 the total operand occurrences. Vocabulary is n=n1+n2n = n_1 + n_2, length is N=N1+N2N = N_1 + N_2, and volume is:

V=Nlog2(n)V = N \log_2(n)

Henry-Kafura information flow measures intermodule complexity using information flowing into and out of a module:

HKi=weighti(outi×ini)2HK_i = weight_i(out_i \times in_i)^2

where the weight may be a size or complexity measure.

GQM, Goal-Question-Metric, starts with goals, derives questions about those goals, and only then chooses metrics.

Key results

Metrics should be selected from goals, not collected because they are easy. LOC, defects, effort, and coverage can all be useful, but each can also distort behavior. If a team is rewarded for high LOC, it may write more code than necessary. If it is rewarded for low defect counts, it may underreport defects. A measurement program should include interpretation rules and quality checks.

McCabe's cyclomatic number is useful because it connects control-flow structure to test and maintenance difficulty. The common threshold of 10 is not a mathematical law, but it is a practical warning point: modules above that value deserve review, refactoring, or stronger tests. The decision-count method is valuable because building a full control flow graph for large code is expensive.

Halstead's basic counts remain historically important because they focus on tokens and algorithm expression. Gustafson notes that some of Halstead's more elaborate prediction formulas are questionable, especially when they are not monotonic. The lesson is broader than Halstead: a metric can be easy to calculate and still be invalid for the decision being made.

Henry-Kafura emphasizes coupling through information flow. A small module with many incoming and outgoing flows can be harder to understand than its length suggests. Multiplying input and output flow and then squaring the product penalizes modules that sit at busy communication intersections.

Process metrics measure how work is being performed. Examples include defects per KLOC, review defects found per reviewer-hour, mean time to repair, build failure rate, and requirements volatility. These are not product attributes alone; they describe process behavior and can guide process improvement.

GQM is a guardrail against metric clutter. A goal such as "improve customer satisfaction" leads to questions such as "Are customers reporting fewer defects?" and "Are fixes delivered faster?" Those questions lead to metrics such as customer defect reports, reopened defects, and median time to resolution. Without the goal and question layers, the team may collect numbers that no one uses.

Visual

MetricWhat it emphasizesStrengthWeakness
LOCphysical sizeeasy to countlanguage and style dependent
Cyclomatic complexitycontrol decisionssupports testing and maintainability reviewignores data complexity
Halstead volumetoken vocabulary and lengthlanguage-level expression measurecounting rules can vary
Henry-Kafuraintermodule information flowhighlights coupling hotspotsrequires flow identification
Defects/KLOCobserved defect densityuseful for trend comparisondepends on detection effort
GQM metricsgoal-driven measurementavoids collecting unused numbersrequires careful goal definition

Worked example 1: Cyclomatic complexity from decisions

Problem. A function has this logic: if the input is missing, return an error; otherwise loop over records; inside the loop, if the record is active, process it; after the loop, if no active record was found, return a warning. Estimate McCabe's cyclomatic number using the decision-count method.

Method. Count decisions that split control flow.

  1. The first if input is missing is one decision.

  2. The loop over records is one decision because each loop has a continue/exit choice.

  3. The inner if record is active is one decision.

  4. The final if no active record was found is one decision.

  5. For structured code, cyclomatic complexity is:

C=decisions+1C = decisions + 1
  1. Substitute:
C=4+1=5C = 4 + 1 = 5

Checked answer. The cyclomatic number is 5. This means at least five independent paths are needed for basis-path style structural testing. The answer is checked by listing the four branch points: missing input, loop continuation, active record, and no-active-record warning.

Worked example 2: Halstead volume for a small expression

Problem. Compute Halstead vocabulary, length, and volume for the expression total = price * quantity + tax using a simple counting rule. Operators are =, *, and +. Operands are total, price, quantity, and tax.

Method. Count distinct and total operators and operands.

  1. Distinct operators:
n1=3n_1 = 3
  1. Distinct operands:
n2=4n_2 = 4
  1. Total operator occurrences: =, *, + each appear once.
N1=3N_1 = 3
  1. Total operand occurrences: total, price, quantity, tax each appear once.
N2=4N_2 = 4
  1. Vocabulary:
n=n1+n2=3+4=7n = n_1 + n_2 = 3 + 4 = 7
  1. Length:
N=N1+N2=3+4=7N = N_1 + N_2 = 3 + 4 = 7
  1. Volume:
V=Nlog2(n)=7log2(7)7×2.80719.65\begin{aligned} V &= N\log_2(n) \\ &= 7\log_2(7) \\ &\approx 7 \times 2.807 \\ &\approx 19.65 \end{aligned}

Checked answer. The expression has vocabulary 7, length 7, and volume about 19.65. The check is that every token was classified exactly once under the chosen rule. Different organizations may count syntax differently, so the counting rule must be stated before comparing values.

Code

import math
import re

def halstead_basic(expression, operator_tokens):
pattern = "(" + "|".join(re.escape(op) for op in operator_tokens) + ")"
raw = [tok for tok in re.split(pattern, expression) if tok and not tok.isspace()]
operators = []
operands = []
for token in raw:
token = token.strip()
if not token:
continue
if token in operator_tokens:
operators.append(token)
else:
operands.extend(token.split())

n1 = len(set(operators))
n2 = len(set(operands))
n = n1 + n2
length = len(operators) + len(operands)
volume = length * math.log2(n) if n else 0
return {"n1": n1, "n2": n2, "N": length, "vocabulary": n, "volume": volume}

metrics = halstead_basic("total = price * quantity + tax", ["=", "*", "+"])
for name, value in metrics.items():
print(f"{name}: {value:.2f}" if isinstance(value, float) else f"{name}: {value}")

Common pitfalls

  • Collecting metrics before stating the decision they support.
  • Comparing LOC across languages, teams, or generated-code policies without normalization.
  • Treating a threshold such as cyclomatic complexity 10 as a substitute for engineering judgment.
  • Counting Halstead operators and operands inconsistently, then comparing results as if they used the same rule.
  • Assuming correlation proves validity. A metric must represent the attribute being measured.
  • Optimizing the metric rather than improving the process or product.
  • Forgetting that process metrics depend on detection effort and reporting culture.

Connections