Quantitative Design and Performance
Computer architecture is not mainly a catalog of clever hardware structures. In the Hennessy and Patterson style, it is an engineering discipline built around measurement, trade-offs, and explicit assumptions. A proposed pipeline, cache, instruction set, or multiprocessor mechanism is only meaningful after asking: what workload does it improve, by how much, at what cost, and under which power or reliability limits?
This page introduces the quantitative habits used throughout the rest of the architecture notes. The same ideas reappear in cache design, branch prediction, vector machines, warehouse-scale computers, and storage systems: define the metric, decompose execution time, estimate the affected fraction, and distrust improvements that optimize a narrow component while hurting the complete system.
Definitions
The most important performance metric is execution time. If machine finishes a fixed program in less time than machine , then is faster for that program. Performance is the reciprocal of execution time:
The speedup of a new design over an old design is:
CPU execution time is commonly decomposed as:
Equivalently, because :
For an instruction mix with classes , total CPI is a weighted average:
where is the number of executed instructions of class . This formula is useful because it makes a hidden workload assumption explicit: a faster floating-point unit helps only in proportion to how much floating-point work is actually executed.
Benchmarks are controlled workloads used to compare systems. A good benchmark suite includes real programs, reproducible build and run conditions, input sets large enough to exercise the system, and reporting rules that prevent benchmark-specific tricks from replacing genuine design improvement. Throughput benchmarks measure completed jobs per unit time; latency benchmarks measure time for one job.
The geometric mean is often used to summarize normalized benchmark ratios:
It is preferred for ratios because it treats reciprocal changes symmetrically: a improvement and a slowdown average to , not to an arithmetic mean of .
Key results
Amdahl's law is the central warning against over-valuing local optimization. If fraction of the original execution time can be improved by factor , the overall speedup is:
The proof is a direct normalization. Let old execution time be . The unaffected part remains . The improved part becomes . The new time is , so speedup is the reciprocal.
The same idea applies to parallel processors. If fraction can run perfectly on processors and the rest is serial, then:
As , the limiting speedup is . A program that is parallel has an upper bound of no matter how many processors are purchased, unless the serial fraction is reduced.
The CPU time equation also explains why comparing only clock rates is a pitfall. A design with a higher clock rate can still lose if it executes more instructions or has a worse CPI. A compressed instruction set may reduce instruction count and instruction-cache misses but complicate decoding. A deeper pipeline may shorten clock cycle time but increase branch misprediction penalties. A wider issue processor may reduce CPI on one benchmark and waste power on another.
A useful performance argument therefore has three parts:
- Workload: the exact program or benchmark family being optimized.
- Decomposition: the pieces of time, CPI, miss penalty, or stall time that explain the result.
- Constraint: cost, energy, die area, complexity, verification risk, and availability.
The decomposition step is also how architects avoid double counting. Suppose a branch predictor reduces branch stalls, and a larger instruction window also reduces the visible cost of some branch stalls by finding independent work. Adding the two isolated speedups will overstate the combined result, because both features attack some of the same baseline cycles. A safer method is to build a single baseline time budget, subtract or scale each component once, and then recompute total time.
Another useful habit is to distinguish latency-oriented and throughput-oriented claims. A faster core may reduce the response time of one request, while more cores may increase requests per second. These are both performance improvements, but they answer different questions. In a web service, the better design may be the one that delivers more throughput at a fixed 99th percentile latency. In an interactive application, the better design may be the one that lowers one user's response time even if peak throughput is unchanged.
Finally, measurements should be reproducible. A benchmark report should describe hardware, compiler version, flags, memory size, operating system, input set, thermal settings, and whether turbo or automatic parallelization was enabled. Without those details, a number may be true for one run but not useful for engineering comparison.
Visual
| Metric | Formula | Best use | Common trap |
|---|---|---|---|
| Execution time | measured seconds | User-visible latency | Ignoring I/O and OS effects |
| CPU time | Processor-focused studies | Treating it as wall-clock time | |
| Throughput | jobs per second | Servers and WSCs | Ignoring response-time limits |
| Speedup | Comparing two designs | Reporting only the optimized component | |
| Geomean ratio | Benchmark-suite summaries | Mixing unrelated baselines |
Worked example 1: CPI from an instruction mix
Problem: A program executes 2.0 billion instructions on a 2.5 GHz processor. The instruction mix is 45% integer ALU at CPI 1, 25% loads/stores at CPI 2, 20% branches at CPI 1.5, and 10% floating-point at CPI 4. Find average CPI and CPU time.
Method:
- Convert percentages into weights.
- Compute weighted CPI.
- Compute total cycles.
- Divide by clock rate.
Checked answer: The average CPI is , and the CPU time is seconds. A quick sanity check says the time should be slightly more than one second because the program needs 3.3 billion cycles and the processor supplies 2.5 billion cycles per second.
Worked example 2: Amdahl's law with a faster floating-point unit
Problem: In a baseline run, floating-point operations account for 30% of execution time. A new floating-point unit makes that part faster, but it does not change the rest of the program. What is the overall speedup? What fraction of the new execution time is floating-point?
Method:
- Normalize old execution time to .
- Compute new time.
- Compute speedup.
- Compute the new floating-point fraction.
Checked answer: The overall speedup is about , not . After the change, floating-point work is only about of execution time, so further floating-point acceleration would have much less value.
Code
from math import prod
def weighted_cpi(classes):
total_ic = sum(ic for ic, _ in classes)
total_cycles = sum(ic * cpi for ic, cpi in classes)
return total_cycles / total_ic
def amdahl(fraction, local_speedup):
return 1.0 / ((1.0 - fraction) + fraction / local_speedup)
def geomean(ratios):
return prod(ratios) ** (1.0 / len(ratios))
mix = [
(0.45 * 2_000_000_000, 1.0),
(0.25 * 2_000_000_000, 2.0),
(0.20 * 2_000_000_000, 1.5),
(0.10 * 2_000_000_000, 4.0),
]
cpi = weighted_cpi(mix)
cpu_time = 2_000_000_000 * cpi / 2_500_000_000
print(f"CPI = {cpi:.2f}, CPU time = {cpu_time:.2f} s")
print(f"Amdahl speedup = {amdahl(0.30, 5):.3f}x")
print(f"Geomean = {geomean([1.2, 0.9, 1.5, 1.1]):.3f}x")
Common pitfalls
- Optimizing a component and reporting the component speedup as if it were system speedup.
- Comparing clock rates without comparing instruction count and CPI.
- Using arithmetic means for normalized benchmark ratios.
- Forgetting that benchmark inputs can change cache behavior and branch behavior.
- Treating peak throughput as useful throughput when response-time constraints are violated.
- Ignoring the energy, cost, and verification cost of a performance feature.