Skip to main content

Vector Semantics and Embeddings

Vector semantics represents word meaning with numbers. Jurafsky and Martin introduce count vectors, cosine similarity, TF-IDF, PMI, PPMI, word2vec, GloVe, visualization, evaluation, and bias. Eisenstein frames the same area around the distributional hypothesis, distinguishing distributional statistics from distributed representations and adding Brown clusters, latent semantic analysis, structured contexts, and formal links among embedding objectives.

The central idea is that words occurring in similar contexts tend to have related meanings. Count-based methods make this idea explicit with word-context matrices. Predictive embedding methods learn dense vectors that are more compact and often more useful downstream. Modern contextual language models extend the same idea by giving each token occurrence its own vector.

Definitions

The distributional hypothesis says that a word's meaning is reflected in the contexts in which it appears. A word-context matrix XX has rows for target words and columns for contexts. An entry might be a raw count, a weighted count, or a transformed association score.

Cosine similarity compares vector directions:

cos(u,v)=uvuv.\cos(u,v)=\frac{u^\top v}{||u||\,||v||}.

Cosine is common because vector length often reflects frequency more than semantic content.

TF-IDF weights a term by how frequent it is in a document and how rare it is across documents:

tfidf(t,d)=tf(t,d)logNdf(t).\mathrm{tfidf}(t,d)=\mathrm{tf}(t,d)\log\frac{N}{\mathrm{df}(t)}.

Pointwise mutual information measures association between a word ww and context cc:

PMI(w,c)=logP(w,c)P(w)P(c).\mathrm{PMI}(w,c)=\log\frac{P(w,c)}{P(w)P(c)}.

Positive PMI clips negative values:

PPMI(w,c)=max(PMI(w,c),0).\mathrm{PPMI}(w,c)=\max(\mathrm{PMI}(w,c),0).

Embeddings are dense vectors, usually with tens to thousands of dimensions. Word2vec learns embeddings by predicting words from contexts or contexts from words. GloVe factorizes a transformation of global co-occurrence counts. Both produce static embeddings: one vector per word type.

Key results

Sparse count vectors are interpretable but high-dimensional. Dense embeddings are less interpretable but compact and effective. Count-based and predictive methods are not opposites: many predictive objectives can be understood as implicitly factorizing a shifted PMI-like matrix.

The skip-gram version of word2vec predicts context words from a target word. With negative sampling, it trains a binary classifier to distinguish observed word-context pairs from randomly sampled noise pairs. For target ww, observed context cc, and negative samples nin_i, a simplified objective is

logσ(vcuw)+ilogσ(vniuw).\log\sigma(v_c^\top u_w)+\sum_i\log\sigma(-v_{n_i}^\top u_w).

CBOW reverses the direction: it predicts the target from a bag or average of context embeddings. GloVe minimizes a weighted squared error:

J=i,jf(Xij)(wiw~j+bi+b~jlogXij)2.J=\sum_{i,j} f(X_{ij})\left(w_i^\top \tilde{w}_j+b_i+\tilde{b}_j-\log X_{ij}\right)^2.

Context definition changes what similarity means. A small symmetric window captures topical and functional similarity. Dependency contexts can emphasize syntactic substitutability. Document contexts support information retrieval and topic similarity. Subword-enriched embeddings improve rare and morphologically complex words.

Static embeddings conflate senses. The vector for bank averages financial and river contexts. Contextual embeddings from BERT-style and GPT-style models produce different vectors for different occurrences, which helps word-sense disambiguation and downstream tasks.

Embeddings also encode social biases present in training corpora. Bias is not only a geometric artifact; it is a data, annotation, deployment, and evaluation issue. Measuring nearest neighbors, analogy behavior, or downstream disparities can reveal problems, but mitigation must be tied to the actual application.

Count vectors and embeddings differ in how much they expose the modeling assumptions. With PPMI, one can inspect which contexts make two words similar. With dense vectors, the dimensions are usually not directly interpretable, but the model can compress many weak contextual signals into a useful representation. This creates a tradeoff between transparency and downstream accuracy. A good workflow is to start with count-based representations to understand the corpus, then use embeddings when generalization and compactness matter.

The context window is a hidden semantic choice. A wide window tends to group words by topic, so doctor may be close to hospital. A narrow or dependency-based context tends to group words by substitutability, so doctor may be closer to nurse or surgeon. Neither is universally correct. Information retrieval, analogy solving, WSD, and syntactic features can prefer different context definitions.

Static embeddings are often evaluated with word similarity or analogy datasets, but these tests cover only a small slice of usefulness. Downstream evaluation is still necessary. A representation that looks worse on analogies may be better for NER, sentiment, or low-resource transfer if its subword modeling, corpus domain, or frequency handling better matches the task.

Visual

RepresentationMatrix or objectiveBest forMain limitation
Raw countsC(w,c)C(w,c)transparent corpus analysisfrequency dominates
TF-IDFterm by documentretrieval and document similaritynot a lexical semantic model by itself
PPMIword by contextinterpretable associationsparse and noisy for rare words
LSA/SVDlow-rank count matrixcompression and smoothingexpensive on large matrices
word2vecprediction with negative samplingefficient dense embeddingsstatic senses
GloVeweighted count factorizationglobal co-occurrence structurestatic senses
Contextual LMtransformer hidden statestoken meaning in contextexpensive and layer-dependent

Worked example 1: TF-IDF by hand

Problem: compute TF-IDF for cat in document d1d_1.

Corpus:

d1: cat sat cat
d2: dog sat
d3: cat ate fish

Use raw term frequency and idf(t)=log(N/df(t))\mathrm{idf}(t)=\log(N/\mathrm{df}(t)).

  1. Number of documents:
N=3.N=3.
  1. Term frequency of cat in d1d_1:
tf(cat,d1)=2.\mathrm{tf}(\mathrm{cat},d_1)=2.
  1. Document frequency of cat: it appears in d1d_1 and d3d_3, so
df(cat)=2.\mathrm{df}(\mathrm{cat})=2.
  1. IDF:
idf(cat)=log320.405.\mathrm{idf}(\mathrm{cat})=\log\frac{3}{2}\approx0.405.
  1. TF-IDF:
tfidf(cat,d1)=2(0.405)=0.811.\mathrm{tfidf}(\mathrm{cat},d_1)=2(0.405)=0.811.

Checked answer: cat in d1d_1 has TF-IDF about 0.8110.811 with natural logarithms. It receives less weight than a word that appears only in one document, because it is not maximally document-specific.

Worked example 2: PPMI for a word-context pair

Problem: compute PPMI for target coffee and context drink. Suppose a corpus has 10001000 total word-context pairs. The pair (coffee, drink) occurs 2020 times. coffee appears in 5050 pairs, and drink appears as a context in 100100 pairs.

  1. Estimate probabilities:
P(coffee,drink)=201000=0.02.P(\mathrm{coffee},\mathrm{drink})=\frac{20}{1000}=0.02. P(coffee)=501000=0.05,P(drink)=1001000=0.10.P(\mathrm{coffee})=\frac{50}{1000}=0.05,\qquad P(\mathrm{drink})=\frac{100}{1000}=0.10.
  1. Compute independence expectation:
P(coffee)P(drink)=0.05(0.10)=0.005.P(\mathrm{coffee})P(\mathrm{drink})=0.05(0.10)=0.005.
  1. Compute PMI:
PMI=log0.020.005=log41.386.\mathrm{PMI}=\log\frac{0.02}{0.005}=\log 4\approx1.386.
  1. Clip at zero:
PPMI=max(1.386,0)=1.386.\mathrm{PPMI}=\max(1.386,0)=1.386.

Checked answer: PPMI is about 1.3861.386, indicating that coffee and drink co-occur four times more often than independence would predict.

Code

from collections import Counter, defaultdict
from math import log, sqrt

sentences = [
"coffee drink hot",
"tea drink hot",
"coffee bean roast",
"dog bark loud",
]

window = 1
pairs = Counter()
word_counts = Counter()
context_counts = Counter()

for sent in sentences:
words = sent.split()
for i, w in enumerate(words):
word_counts[w] += 1
for j in range(max(0, i - window), min(len(words), i + window + 1)):
if i == j:
continue
c = words[j]
pairs[(w, c)] += 1
context_counts[c] += 1

total = sum(pairs.values())
vocab = sorted(word_counts)

def ppmi(w, c):
if pairs[(w, c)] == 0:
return 0.0
p_wc = pairs[(w, c)] / total
p_w = sum(pairs[(w, x)] for x in context_counts) / total
p_c = context_counts[c] / total
return max(log(p_wc / (p_w * p_c)), 0.0)

def vector(w):
return [ppmi(w, c) for c in vocab]

def cosine(a, b):
dot = sum(x * y for x, y in zip(a, b))
na = sqrt(sum(x * x for x in a))
nb = sqrt(sum(y * y for y in b))
return dot / (na * nb) if na and nb else 0.0

print(cosine(vector("coffee"), vector("tea")))
print(cosine(vector("coffee"), vector("dog")))

Common pitfalls

  • Treating cosine similarity as meaning equivalence; it may reflect topic, syntax, domain, or frequency artifacts.
  • Forgetting that PMI is biased toward rare events unless counts are smoothed or thresholded.
  • Comparing embeddings trained with different corpora, dimensions, tokenization, or context windows as if they were directly equivalent.
  • Ignoring out-of-vocabulary and rare word behavior.
  • Assuming analogies prove deep semantic understanding.
  • Using static embeddings for sense-sensitive tasks without checking ambiguity.
  • Treating embedding bias as solved by a single vector projection.

Connections