Skip to main content

Quantum Machine Learning

Quantum machine learning studies whether quantum circuits can improve learning, inference, optimization, or data analysis. The honest view is mixed: quantum kernels, parametrized circuits, QAOA-style optimization, and fault-tolerant quantum linear-algebra subroutines are mathematically rich, but broad practical advantage over strong classical machine learning and deep learning baselines is not established.

Nielsen and Chuang do not cover QML as a separate topic. This page keeps the wiki's modern QML treatment and uses N&C-style notation from Chapters 2, 8, 11, and 12: density operators ρ\rho, channels E\mathcal{E}, POVMs, trace distance, fidelity, von Neumann entropy, and quantum information-processing resource accounting.

Definitions

A parametrized quantum circuit is a unitary family U(x,θ)U(x,\theta) depending on input data xx and trainable parameters θ\theta. A common supervised model prepares

ψ(x,θ)=U(x,θ)0n|\psi(x,\theta)\rangle=U(x,\theta)|0^n\rangle

and predicts from an expectation value

y^(x,θ)=ψ(x,θ)Mψ(x,θ)\hat{y}(x,\theta)=\langle\psi(x,\theta)|M|\psi(x,\theta)\rangle

for some observable MM. In N&C density-operator notation, this becomes

y^(x,θ)=Tr ⁣[Mρ(x,θ)],ρ(x,θ)=U(x,θ)0n0nU(x,θ).\hat{y}(x,\theta)=\mathrm{Tr}\!\left[M\rho(x,\theta)\right], \qquad \rho(x,\theta)=U(x,\theta)|0^n\rangle\langle0^n|U(x,\theta)^\dagger.

With noise, the prediction is better written as

y^(x,θ)=Tr ⁣[ME(ρ(x,θ))],\hat{y}(x,\theta)=\mathrm{Tr}\!\left[M\mathcal{E}(\rho(x,\theta))\right],

where E(ρ)=kEkρEk\mathcal{E}(\rho)=\sum_k E_k\rho E_k^\dagger is a quantum operation. This notation matters: many QML claims change substantially when the ideal pure state is replaced by the noisy state actually measured on hardware.

A variational quantum classifier combines a feature map Uϕ(x)U_\phi(x), a trainable ansatz UθU_\theta, measurement, and a classical loss. Training is hybrid: a quantum device estimates expectation values, while a classical optimizer updates θ\theta.

A quantum kernel maps data to quantum states and defines

K(x,x)=ϕ(x)ϕ(x)2,ϕ(x)=Uϕ(x)0n.K(x,x')=|\langle\phi(x)|\phi(x')\rangle|^2, \qquad |\phi(x)\rangle=U_\phi(x)|0^n\rangle.

Equivalently, using density operators,

K(x,x)=Tr ⁣[ρ(x)ρ(x)]K(x,x')=\mathrm{Tr}\!\left[\rho(x)\rho(x')\right]

for pure feature states. A classical kernel method such as an SVM or kernel ridge regressor can then use the estimated kernel matrix.

QAOA, the quantum approximate optimization algorithm, alternates cost and mixer unitaries. For a combinatorial objective encoded as Hamiltonian CC and a mixer BB, a depth-pp QAOA state is

γ,β==1peiβBeiγC+n.|\gamma,\beta\rangle= \prod_{\ell=1}^p e^{-i\beta_\ell B}e^{-i\gamma_\ell C}|+\rangle^{\otimes n}.

QAOA is not machine learning by itself, but it sits near QML because it uses parametrized circuits, measurement estimates, and classical optimization.

A POVM measurement is a collection of positive operators {My}\{M_y\} with yMy=I\sum_y M_y=I. In classification, one can interpret

p(yx,θ)=Tr ⁣[Myρ(x,θ)]p(y|x,\theta)=\mathrm{Tr}\!\left[M_y\rho(x,\theta)\right]

as the model's predicted class probabilities.

A barren plateau is a training regime where gradients concentrate near zero, often exponentially in the number of qubits for global costs or sufficiently random deep ansatzes:

Var(Cθj)O(2n)\mathrm{Var}\left(\frac{\partial C}{\partial\theta_j}\right)\sim O(2^{-n})

in common idealized settings.

The von Neumann entropy

S(ρ)=Tr(ρlogρ)S(\rho)=-\mathrm{Tr}(\rho\log\rho)

is not a training loss by default, but it is the N&C language for mixedness, compression, and information flow. It becomes relevant when evaluating noisy encodings, learned quantum channels, privacy leakage, or information bottleneck analogues.

Key results

The parameter-shift rule gives exact gradients for many gates. If a parameter appears in

Uj(θj)=eiθjP/2U_j(\theta_j)=e^{-i\theta_j P/2}

where PP has eigenvalues ±1\pm1, then for an expectation-value component f(θ)f(\theta),

fθj=12[f(θj+π2)f(θjπ2)],\frac{\partial f}{\partial\theta_j} =\frac{1}{2}\left[ f\left(\theta_j+\frac{\pi}{2}\right) -f\left(\theta_j-\frac{\pi}{2}\right) \right],

with other parameters held fixed. The identity is exact in the circuit model, but estimating the two shifted values on hardware introduces shot noise.

Quantum kernels are valid positive semidefinite kernels because they are Hilbert-space inner products. For any coefficients cic_i,

i,jcicjϕ(xi)ϕ(xj)=iciϕ(xi)20.\sum_{i,j}c_i^*c_j\langle\phi(x_i)|\phi(x_j)\rangle =\left\|\sum_i c_i|\phi(x_i)\rangle\right\|^2\ge 0.

For the squared-overlap kernel, positive semidefiniteness follows by viewing ρ(x)\rho(x) as the feature vector in Hilbert-Schmidt space. A possible advantage requires both a feature map whose kernel is hard to estimate classically and a learning task that benefits from that kernel. Either condition alone is insufficient.

Noisy QML should be expressed with channels. If the intended model is

ρθ=Uθρ0Uθ,\rho_\theta=U_\theta\rho_0U_\theta^\dagger,

but the actual hardware implements

ρ~θ=ELULE1U1(ρ0),\widetilde{\rho}_\theta=\mathcal{E}_L\circ\mathcal{U}_L\circ\cdots\circ\mathcal{E}_1\circ\mathcal{U}_1(\rho_0),

then the learned function is not the ideal circuit plus small after-the-fact noise; it is a noisy quantum operation interleaved with the computation. N&C's Chapter 8 operator-sum language is the right notation for this.

The trace distance and fidelity supply evaluation tools beyond accuracy. For two states ρ,σ\rho,\sigma, trace distance measures distinguishability, while fidelity measures overlap. A QML embedding that maps nearby classical examples to nearly indistinguishable states may be hard to classify; an embedding that maps every training example to nearly orthogonal states may overfit and be expensive to estimate.

Generalization is a statistical question, not a quantum slogan. A high-dimensional Hilbert space can make training data separable, but useful learning requires an inductive bias aligned with the data distribution. The same discipline used in classical learning still applies: train/test splits, hyperparameter control, baseline strength, sample complexity, and uncertainty estimates.

NISQ QML and fault-tolerant QML should be separated. NISQ QML uses shallow circuits and accepts device noise as part of the training environment. Fault-tolerant QML could use deeper subroutines such as amplitude estimation, phase estimation, block-encoding, Hamiltonian simulation, or HHL-like linear algebra. Those subroutines are closer to quantum algorithms and require the logical qubits supplied by quantum error correction.

Visual

The diagram separates three QML workflows that often get blurred together. VQE and QAOA are variational loops with explicit ansatz layers, measurements, noise, and optimizer feedback, while the kernel circuit estimates state overlaps by applying one feature map followed by the inverse of another. The labeled shapes show where classical tensors enter, where shot data leaves the device, and where parameters return through the dotted feedback arrows.

QML approachQuantum objectN&C notation that clarifies itMain risk
Variational classifierρ(x,θ)\rho(x,\theta) and observablesTr(Mρ)\mathrm{Tr}(M\rho), POVMsNoise, barren plateaus, weak baselines
Quantum kernelState overlapsFidelity and Hilbert-Schmidt inner productKernel may be classically easy or uninformative
QAOA-style optimizerAlternating unitariesHamiltonian evolution and expectation valuesDepth, landscape, and shot cost
Noisy trainingInterleaved channelsKraus maps and process tomographyLearned model differs from ideal circuit
Fault-tolerant QMLAlgorithmic subroutinesPhase estimation, amplitude estimation, entropyInput/output assumptions dominate

Worked example 1: Parameter-shift gradient for one qubit

Problem. Let

f(θ)=0Ry(θ)ZRy(θ)0.f(\theta)=\langle0|R_y(\theta)^\dagger Z R_y(\theta)|0\rangle.

Compute f(θ)f(\theta) and verify the parameter-shift gradient at θ=π/3\theta=\pi/3.

Method.

  1. Apply the rotation:
Ry(θ)0=cos(θ/2)0+sin(θ/2)1.R_y(\theta)|0\rangle =\cos(\theta/2)|0\rangle+\sin(\theta/2)|1\rangle.
  1. The ZZ expectation is probability of 00 minus probability of 11:
f(θ)=cos2(θ/2)sin2(θ/2)=cosθ.f(\theta)=\cos^2(\theta/2)-\sin^2(\theta/2)=\cos\theta.
  1. Differentiate analytically:
f(θ)=sinθ.f'(\theta)=-\sin\theta.

At θ=π/3\theta=\pi/3,

f(π/3)=sin(π/3)=32.f'(\pi/3)=-\sin(\pi/3)=-\frac{\sqrt{3}}{2}.
  1. Apply the parameter-shift rule:
12[f(π3+π2)f(π3π2)].\frac{1}{2}\left[ f\left(\frac{\pi}{3}+\frac{\pi}{2}\right) -f\left(\frac{\pi}{3}-\frac{\pi}{2}\right) \right].
  1. Evaluate the two shifted terms:
f(5π/6)=cos(5π/6)=32,f(5\pi/6)=\cos(5\pi/6)=-\frac{\sqrt{3}}{2}, f(π/6)=cos(π/6)=32.f(-\pi/6)=\cos(-\pi/6)=\frac{\sqrt{3}}{2}.
  1. Subtract and divide by 22:
12(3232)=32.\frac{1}{2}\left(-\frac{\sqrt{3}}{2}-\frac{\sqrt{3}}{2}\right) =-\frac{\sqrt{3}}{2}.

Answer. The parameter-shift estimate equals the analytic derivative, 3/2-\sqrt{3}/2. The checked condition is that Ry(θ)R_y(\theta) has a Pauli generator with the required two-eigenvalue spectrum.

Worked example 2: A two-point quantum kernel

Problem. Use the one-qubit feature map

ϕ(x)=Ry(x)0|\phi(x)\rangle=R_y(x)|0\rangle

and compute the kernel value K(0,π/2)K(0,\pi/2).

Method.

  1. For x=0x=0,
ϕ(0)=0.|\phi(0)\rangle=|0\rangle.
  1. For x=π/2x=\pi/2,
ϕ(π/2)=cos(π/4)0+sin(π/4)1=120+121.|\phi(\pi/2)\rangle =\cos(\pi/4)|0\rangle+\sin(\pi/4)|1\rangle =\frac{1}{\sqrt{2}}|0\rangle+\frac{1}{\sqrt{2}}|1\rangle.
  1. Compute the inner product:
ϕ(0)ϕ(π/2)=0(120+121)=12.\langle\phi(0)|\phi(\pi/2)\rangle =\langle0|\left(\frac{1}{\sqrt{2}}|0\rangle+\frac{1}{\sqrt{2}}|1\rangle\right) =\frac{1}{\sqrt{2}}.
  1. Square the magnitude:
K(0,π/2)=122=12.K(0,\pi/2)=\left|\frac{1}{\sqrt{2}}\right|^2=\frac{1}{2}.
  1. Check with density operators. Since ρ0=00\rho_0=\vert 0\rangle\langle0\vert and ρπ/2=ϕ(π/2)ϕ(π/2)\rho_{\pi/2}=\vert \phi(\pi/2)\rangle\langle\phi(\pi/2)\vert ,
Tr(ρ0ρπ/2)=0ϕ(π/2)2=12.\mathrm{Tr}(\rho_0\rho_{\pi/2})=|\langle0|\phi(\pi/2)\rangle|^2=\frac{1}{2}.

Answer. The kernel value is 1/21/2. The states are neither identical nor orthogonal, so the kernel lies strictly between 00 and 11.

Code

This NumPy example trains a one-qubit variational classifier with parameter-shift gradients and evaluates the noisy prediction using a simple depolarizing channel in density-matrix notation.

import numpy as np

I = np.eye(2)
X = np.array([[0, 1], [1, 0]], dtype=complex)
Y = np.array([[0, -1j], [1j, 0]], dtype=complex)
Z = np.array([[1, 0], [0, -1]], dtype=complex)
ZERO = np.array([[1.0], [0.0]], dtype=complex)
RHO0 = ZERO @ ZERO.conj().T

def ry(theta):
c = np.cos(theta / 2)
s = np.sin(theta / 2)
return np.array([[c, -s], [s, c]], dtype=complex)

def depolarize(rho, p):
return (1 - p) * rho + p * I / 2

def prediction(x, theta, noise=0.0):
u = ry(theta) @ ry(x)
rho = u @ RHO0 @ u.conj().T
rho = depolarize(rho, noise)
return float(np.real(np.trace(Z @ rho)))

def loss(xs, ys, theta, noise=0.0):
return np.mean([(prediction(x, theta, noise) - y) ** 2 for x, y in zip(xs, ys)])

def gradient(xs, ys, theta, noise=0.0):
plus = loss(xs, ys, theta + np.pi / 2, noise)
minus = loss(xs, ys, theta - np.pi / 2, noise)
return 0.5 * (plus - minus)

xs = np.linspace(-1.0, 1.0, 9)
ys = np.where(xs >= 0, 1.0, -1.0)
theta = 0.0

for _ in range(50):
theta -= 0.2 * gradient(xs, ys, theta, noise=0.03)

print(f"theta={theta:.3f} noisy_loss={loss(xs, ys, theta, noise=0.03):.3f}")
for x in xs:
print(f"x={x:+.2f} prediction={prediction(x, theta, noise=0.03):+.3f}")

Common pitfalls

  • Assuming QML means automatic speedup. A quantum model must beat classical baselines under the same data, tuning, and resource accounting.
  • Ignoring data encoding. Loading a large classical dataset into amplitudes can cost more than the intended speedup saves.
  • Treating an ideal circuit as the implemented model. Hardware realizes noisy channels, not exact unitaries.
  • Reporting training accuracy without generalization. A circuit can fit a small dataset without providing useful inductive bias.
  • Overusing the phrase "quantum neural network." The circuit architecture, loss, observables, and data map matter more than the analogy.
  • Neglecting shot noise. Gradients and losses estimated from finite measurements have variance.
  • Choosing overly expressive ansatzes. Random deep circuits can suffer barren plateaus and become trainability failures.
  • Treating separability as generalization. A feature map that separates the training set may still fail on unseen data.
  • Comparing against weak classical baselines. Kernel methods, tensor networks, randomized features, and modern neural networks are serious competitors.
  • Hiding optimizer cost. Many QML experiments spend substantial classical time on tuning, restarts, and learning-rate choices.

Connections

  • Quantum algorithms supplies phase estimation, amplitude amplification, HHL-style linear algebra, and oracle models.
  • Quantum hardware determines circuit depth, noise channels, measurement budget, and connectivity.
  • Quantum error correction separates NISQ QML from future fault-tolerant QML.
  • Machine learning provides kernels, generalization, optimization, model selection, and baseline discipline.
  • Deep learning is the natural benchmark for claims involving high-dimensional data and learned representations.
  • Linear algebra supplies Hilbert spaces, kernels, eigensystems, matrix conditioning, and tensor products.
  • Quantum communication connects when QML models process distributed quantum states or learned channels.
  • Quantum mechanics supplies density operators, observables, measurement, and open-system language.

Further reading

  • Michael A. Nielsen and Isaac L. Chuang, Quantum Computation and Quantum Information, Chapters 2, 8, 11, and 12 for the notation used here.
  • Maria Schuld and Francesco Petruccione, Supervised Learning with Quantum Computers.
  • Jacob Biamonte and collaborators, review on quantum machine learning.
  • Edward Farhi, Jeffrey Goldstone, and Sam Gutmann, QAOA.
  • Jarrod McClean and collaborators, barren plateaus in quantum neural network training landscapes.
  • Maria Schuld, Ryan Sweke, and Johannes Meyer, effect of data encoding on expressive power.
  • Vojtech Havlicek and collaborators, supervised learning with quantum-enhanced feature spaces.
  • John Preskill, writing on NISQ computing and near-term quantum devices.