Mining Spatial and Trajectory Data
Spatial data mining studies objects with location, geometry, distance, neighborhood, and regional context. Aggarwal's spatial chapter covers mining with contextual spatial attributes and trajectory mining. Spatial data sits between multidimensional mining and graph/time-series mining: a point has numeric coordinates, but its meaning often depends on nearby objects, regions, paths, and movement through space.
This page covers spatial context, spatial autocorrelation, spatial clustering, spatial outlier detection, trajectory pattern mining, trajectory clustering, trajectory outlier detection, and trajectory classification.
Definitions
A spatial object has geometric information such as a point, line, polygon, raster cell, or region.
A spatial attribute is a feature tied to location. Examples include population density, elevation, pollution, store revenue, and temperature.
Spatial autocorrelation means nearby locations tend to have related values. Positive autocorrelation occurs when nearby values are similar; negative autocorrelation occurs when nearby values differ systematically.
A neighborhood relation defines which objects influence each other. It may use distance radius, nearest neighbors, adjacency of regions, or road-network distance.
A trajectory is an ordered sequence of spatial positions:
Trajectory mining analyzes movement paths for patterns, clusters, outliers, or labels.
Map matching aligns raw location observations to a road network or other spatial network.
Trajectory similarity may use pointwise distance, longest common subsequence, edit-distance variants, dynamic time warping, or route-overlap measures.
Key results
Spatial context changes feature meaning. A high crime count may be expected in a dense downtown region and unusual in a small residential region. Spatial mining often needs contextual normalization.
Nearby observations are not independent. Standard validation assumptions can fail when train and test sets contain neighboring locations. Spatial cross-validation should separate regions when testing generalization to new areas.
Spatial outliers are often local. A location may be globally ordinary but locally anomalous relative to its neighbors. A temperature of 30 degrees may be normal in summer overall but unusual if surrounding sensors report 20.
Trajectory mining combines order and geometry. A trajectory is not just a set of coordinates. Direction, speed, stops, turns, and time gaps can be essential.
Trajectory simplification can reduce noise. GPS traces may contain many redundant points. Simplification keeps shape-defining points, reducing cost before clustering or classification.
Road-network distance may matter more than Euclidean distance. Two points separated by a river, wall, or road network may be close geometrically but far in travel time.
Spatial scale controls what counts as local. A neighborhood radius of 100 meters might detect building-level anomalies, while 10 kilometers might detect city-level regions. The same point can be normal at one scale and unusual at another. Multi-scale analysis is common in spatial mining because geographic processes such as traffic, disease, weather, and retail demand operate at different resolutions.
Trajectory data usually need event extraction. Raw GPS points are often too noisy and dense for direct mining. Analysts may extract stops, moves, road segments, turns, speeds, and dwell times before clustering or classification. This converts a numeric path into a richer sequence of movement events, but it also introduces parameters such as stop-duration thresholds and map-matching confidence.
Spatial joins are often the expensive step. Many spatial mining tasks start by connecting points to regions, sensors to nearby sensors, trips to roads, or events to neighborhoods. Indexes such as grids, R-trees, and geohashes are therefore not just database details; they determine whether the mining workflow can run at the required scale.
Boundary effects require care. A point near the edge of the study region has fewer observed neighbors, not necessarily fewer real neighbors. Local density and local outlier scores can be biased at borders unless the analysis accounts for the observation window.
Coordinate reference systems matter. Distances, areas, and angles depend on projection. A method that is correct for planar coordinates can be wrong on latitude-longitude pairs if the study region is large. Spatial mining should record the coordinate system, projection, and distance function before interpreting numeric distances.
Privacy is often spatial too. Trajectories and home-work patterns can identify people even after names are removed. Aggregation, spatial masking, and minimum-count rules may be needed before publishing maps or mined mobility patterns.
Visualization is useful but can mislead. Map projections, marker sizes, overplotting, and color scales can create apparent clusters or hide sparse anomalies. Spatial mining should pair maps with quantitative checks.
Temporal context often matters for spatial data. A location pattern during rush hour, a storm, or a holiday may be normal only within that temporal context. Spatial and time-series preparation frequently need to be combined.
Visual
Trajectory example
t1 t2 t3 t4
* ------ * ------ * -----> *
\ road B
\
* ------ * road A
The same points may be close in Euclidean space but different as routes.
| Task | Static spatial version | Trajectory version |
|---|---|---|
| Similarity | Distance or neighbor context | Route, shape, timing similarity |
| Clustering | Group nearby similar locations | Group similar movement paths |
| Outlier detection | Local deviation from neighbors | Unusual route, stop, speed, or segment |
| Classification | Predict region label | Predict trip purpose or object type |
| Pattern mining | Co-located events | Frequent routes or movement motifs |
Worked example 1: Local spatial outlier score
Problem. Four sensors report temperature:
| sensor | location | value |
|---|---|---|
| A | center | 20 |
| B | near A | 21 |
| C | near A | 19 |
| D | near A | 35 |
Use the difference between a sensor value and the mean of its three neighbors as a local outlier score. Compute the score for D and A.
Method.
- For D, neighbors are A, B, and C with values 20, 21, and 19.
- Neighbor mean:
- D's absolute local deviation:
- For A, neighbors are B, C, and D with values 21, 19, and 35.
- Neighbor mean:
- A's absolute local deviation:
Checked answer. D has a larger local anomaly score, 15 versus 5. The result also shows a pitfall: if an outlier is included as a neighbor, it can distort other local scores.
Worked example 2: Trajectory simplification by perpendicular distance
Problem. A trajectory has points
If we approximate the first three points by line segment , should be kept when tolerance is 0.2?
Method.
- Segment lies on the x-axis from to .
- The perpendicular distance from to the x-axis is .
- Compare with tolerance:
- Therefore can be removed without exceeding the tolerance for this segment.
- Point cannot be represented by the same segment without a much larger deviation, so it should be kept as a shape-changing point.
Checked answer. Remove under tolerance 0.2; keep , , and for the simplified trajectory.
Code
Pseudocode for local spatial outlier detection:
INPUT: spatial points with values, neighbor rule N
OUTPUT: local outlier scores
for each point i:
neighbors = N(i)
expected = average value among neighbors
score[i] = abs(value[i] - expected)
rank points by descending score
return scores
import numpy as np
from sklearn.cluster import DBSCAN
from sklearn.neighbors import NearestNeighbors
coords = np.array([[0, 0], [0, 1], [1, 0], [10, 10]], dtype=float)
values = np.array([20, 21, 19, 35], dtype=float)
nn = NearestNeighbors(n_neighbors=3).fit(coords)
dist, ind = nn.kneighbors(coords)
scores = []
for i, neighbors in enumerate(ind):
# first neighbor is the point itself; skip it
neigh = [j for j in neighbors if j != i]
expected = values[neigh].mean()
scores.append(abs(values[i] - expected))
print("local scores:", np.round(scores, 3))
trajectory_points = np.array([[0, 0], [1, 0.1], [2, 0], [3, 3]], dtype=float)
route_clusters = DBSCAN(eps=1.5, min_samples=2).fit_predict(trajectory_points)
print("point-level spatial clusters:", route_clusters)
Common pitfalls
- Treating latitude and longitude as ordinary Euclidean coordinates over large areas.
- Ignoring road-network or barrier effects when travel distance matters.
- Randomly splitting nearby spatial points into train and test sets, causing optimistic evaluation.
- Removing "noisy" GPS points that are actually meaningful stops or turns.
- Comparing trajectories only by start and end points.
- Forgetting that local spatial outliers can be hidden by global statistics.
- Using a single spatial scale when patterns exist at multiple scales.