Dependency Parsing
Dependency parsing represents syntax as directed relations between words. Jurafsky and Martin present dependency relations, transition-based parsing, graph-based parsing, and evaluation. Eisenstein gives a formal structure prediction account: dependency parses are directed graphs over words, graph-based parsers search for high-scoring trees, and transition-based parsers build trees with actions.
Dependency syntax is often more compact than constituency syntax and is widely used across languages through Universal Dependencies. It directly identifies who depends on whom: subjects attach to predicates, objects attach to verbs, modifiers attach to heads, and function words receive language-specific analyses. These arcs are useful for information extraction, semantic role labeling, relation extraction, and downstream interpretation.
Definitions
A dependency parse is a directed graph over the tokens of a sentence, usually a tree rooted at a special ROOT node. Each word except the root has exactly one head. An arc may be labeled, for example nsubj, obj, amod, or obl.
For a sentence , a dependency tree can be written as a set of arcs:
Here is the head index, is the modifier index, and is the relation label.
A parse is projective if its arcs do not cross when drawn above the sentence. Informally, every word between a head and dependent is inside the head's subtree. Non-projective arcs occur in many languages and are important for free word order.
A transition-based parser uses a parser state, often including a stack, buffer, and set of arcs. Actions such as SHIFT, LEFT-ARC, and RIGHT-ARC update the state until a tree is complete.
A graph-based parser scores possible arcs and searches for the highest-scoring valid tree:
An arc-factored model assumes the score of a tree decomposes as a sum of independent arc scores.
Key results
Dependency parsing is a structured prediction problem. The number of possible trees grows rapidly with sentence length, but useful decompositions make search tractable.
Transition-based parsers are fast. In an arc-standard system, a projective parse can be built with a linear number of actions. At each state, a classifier chooses the next action from features of the stack, buffer, existing arcs, and neural representations. The risk is search error: an early wrong action can make the gold tree unreachable.
Graph-based parsers search more globally. If the parse must be projective, dynamic programming algorithms related to lexicalized constituency parsing can find the best tree. If non-projective trees are allowed and the model is arc-factored, maximum spanning tree algorithms such as Chu-Liu-Edmonds can find the best tree. Higher-order graph-based parsing scores siblings or grandparents, but exact inference becomes more expensive and sometimes NP-hard for non-projective trees.
Evaluation uses unlabeled attachment score and labeled attachment score:
Dependency choices are not purely mathematical; annotation guidelines matter. In Universal Dependencies, content words are often preferred as heads, which can differ from older constituency-derived head rules. Coordination, ellipsis, copulas, prepositions, and multiword expressions are common sources of disagreement.
Modern parsers usually score arcs using contextual encoders such as BiLSTMs or transformers. A biaffine scorer computes head-dependent compatibility for all pairs, and a tree algorithm or local selection produces the final parse.
The contrast with constituency parsing is not only visual. Constituency structures emphasize nested spans; dependency structures emphasize bilexical relations. This makes dependencies attractive for relation extraction, where the shortest dependency path between two entities can be more informative than the surface words between them. It also makes dependencies easier to compare across languages with freer word order, although annotation decisions still vary.
Transition-based and graph-based parsers embody a tradeoff between greedy flexibility and global optimality. A transition parser can use rich features of the current configuration and run very quickly, but it may commit to an action that blocks the correct tree. A graph-based parser can optimize a global tree objective, but the objective is often restricted to arc-factored or low-order scores so inference remains tractable. Neural scoring reduces the need for manual feature engineering in both families.
Projectivity is a useful diagnostic. If a language or domain has many non-projective sentences, a parser restricted to projective trees has a ceiling no matter how good its classifier is. If the corpus is mostly projective, projective parsing may be faster and simpler. The right algorithm depends on treebank guidelines, language typology, and downstream tolerance for attachment errors.
Dependency parsing also depends heavily on tokenization and multiword conventions. A contraction, compound name, or fixed expression may be represented as one token, several syntactic words, or a multiword token with internal structure. Universal Dependencies makes these conventions explicit, but downstream users often see only the final arcs. If relation extraction relies on dependency paths, a tokenization mismatch can change the path and therefore the extracted relation.
Error analysis should distinguish head errors from label errors. A parser may attach a modifier to the correct head but choose obl instead of nmod, or choose the wrong head with a plausible label. These two failures affect downstream systems differently. Relation extraction often cares more about the path topology, while grammar feedback may care strongly about the label.
Visual
| Parser type | Search unit | Strength | Weakness |
|---|---|---|---|
| Transition-based | actions over stack and buffer | very fast, flexible features | search errors |
| Graph-based arc-factored | dependency arcs | global tree optimization | limited arc independence |
| Higher-order graph-based | arcs plus siblings or grandparents | richer structure | expensive inference |
| Neural biaffine | contextual pair scores | strong modern accuracy | needs labeled treebanks |
Worked example 1: arc-standard transition parsing
Problem: parse I saw dogs with arcs saw -> I as nsubj, saw -> dogs as obj, and ROOT -> saw as root. Use an arc-standard style stack and buffer. Initial state has stack [ROOT], buffer [I, saw, dogs], arcs {}.
SHIFTmovesIto the stack:- Stack:
[ROOT, I] - Buffer:
[saw, dogs]
- Stack:
SHIFTmovessaw:- Stack:
[ROOT, I, saw] - Buffer:
[dogs]
- Stack:
LEFT-ARC(nsubj)addssaw -> Iand removesI:- Stack:
[ROOT, saw] - Buffer:
[dogs] - Arcs:
{saw -> I}
- Stack:
SHIFTmovesdogs:- Stack:
[ROOT, saw, dogs] - Buffer:
[]
- Stack:
RIGHT-ARC(obj)addssaw -> dogsand removesdogs:- Stack:
[ROOT, saw] - Arcs:
{saw -> I, saw -> dogs}
- Stack:
RIGHT-ARC(root)addsROOT -> sawand removessaw:- Stack:
[ROOT] - Buffer:
[]
- Stack:
Checked answer: the final arc set matches the intended parse.
Worked example 2: choosing a graph-based parse
Problem: choose heads for I saw dogs from these unlabeled arc scores:
| Arc | Score |
|---|---|
ROOT -> saw | |
ROOT -> I | |
ROOT -> dogs | |
saw -> I | |
saw -> dogs | |
I -> saw | |
dogs -> saw | |
I -> dogs | |
dogs -> I |
A valid dependency tree over three words must give each word one head and avoid cycles.
- For
saw, the best incoming arc isROOT -> sawwith score . - For
I, the best incoming arc issaw -> Iwith score . - For
dogs, the best incoming arc issaw -> dogswith score . - These arcs form a connected tree with no cycle:
- Total score:
Checked answer: choose the three arcs above. In more complex cases, locally best incoming arcs can create cycles, so a maximum spanning tree algorithm is needed.
Code
tokens = ["ROOT", "I", "saw", "dogs"]
scores = {
("ROOT", "saw"): 5,
("ROOT", "I"): 0,
("ROOT", "dogs"): 0,
("saw", "I"): 4,
("saw", "dogs"): 3,
("I", "saw"): 1,
("dogs", "saw"): 1,
("I", "dogs"): 0,
("dogs", "I"): 0,
}
# Tiny exhaustive search for demonstration only.
words = tokens[1:]
best = None
for h_i in tokens:
if h_i == "I":
continue
for h_saw in tokens:
if h_saw == "saw":
continue
for h_dogs in tokens:
if h_dogs == "dogs":
continue
arcs = [(h_i, "I"), (h_saw, "saw"), (h_dogs, "dogs")]
if sum(1 for h, m in arcs if h == "ROOT") != 1:
continue
total = sum(scores.get(arc, -99) for arc in arcs)
if best is None or total > best[0]:
best = (total, arcs)
print(best)
Common pitfalls
- Assuming every dependency annotation scheme chooses the same head.
- Ignoring non-projective constructions in languages where they are common.
- Reporting UAS only when relation labels matter for downstream tasks.
- Using transition systems without understanding which trees they can produce.
- Letting local arc choices form cycles in graph-based parsing.
- Comparing parsers across treebanks with different tokenization or punctuation conventions.
- Treating dependency parses as complete semantics; they are syntactic structures, not full meaning representations.