Skip to main content

Data Mining

Data mining is the study of collecting, preparing, modeling, and interpreting data so that large raw data sets yield useful structure. In Charu C. Aggarwal's organization, the subject is built around a practical pipeline and a small number of reusable analytical building blocks: association pattern mining, clustering, outlier detection, and classification. The same building blocks are then adapted to text, time series, discrete sequences, spatial data, graphs, web data, social networks, streams, and privacy-sensitive settings.

These notes are detailed study pages for Aggarwal's Data Mining: The Textbook. They emphasize definitions, algorithmic ideas, worked numerical examples, pseudocode, and Python snippets. The goal is not to replace the book's full treatment, but to give a structured wiki path through the concepts that recur across chapters.

Definitions

Data mining is the process of collecting, cleaning, transforming, analyzing, and interpreting data to discover useful patterns, build predictive models, or identify unusual behavior.

A data object is the unit of analysis: a row in a table, a document, a transaction, a time series, a graph node, a trajectory, or an entire graph.

A feature representation maps raw data into variables or structures that algorithms can consume. The representation may be a dense numeric matrix, sparse term-document matrix, transaction database, sequence database, graph, stream synopsis, or privacy-protected table.

The four recurring analytical building blocks are:

  1. Association pattern mining, which finds frequently co-occurring items, events, or attributes.
  2. Cluster analysis, which groups unlabeled objects by similarity.
  3. Outlier analysis, which ranks or flags unusual objects.
  4. Classification, which learns predictive models from labeled examples.

The generated chapter pages are:

Book chapterWiki page
1. An Introduction to Data MiningData Mining Process and Data Types
2. Data PreparationData Preparation and Feature Selection and Dimensionality Reduction
3. Similarity and DistancesSimilarity and Distances
4. Association Pattern MiningAssociation Pattern Mining
5. Association Pattern Mining: Advanced ConceptsAdvanced Association Patterns
6. Cluster AnalysisCluster Analysis
7. Cluster Analysis: Advanced ConceptsAdvanced Clustering Concepts
8. Outlier AnalysisOutlier Analysis
9. Outlier Analysis: Advanced ConceptsAdvanced Outlier Analysis
10. Data ClassificationData Classification
11. Data Classification: Advanced ConceptsAdvanced Classification Concepts
12. Mining Data StreamsMining Data Streams and Big Data
13. Mining Text DataMining Text Data
14. Mining Time Series DataMining Time Series Data
15. Mining Discrete SequencesMining Discrete Sequences
16. Mining Spatial DataMining Spatial and Trajectory Data
17. Mining Graph DataMining Graph Data
18. Mining Web DataMining Web Data and Recommenders
19. Social Network AnalysisSocial Network Analysis
20. Privacy-Preserving Data MiningPrivacy-Preserving Data Mining

Key results

The mining pipeline is iterative. Data collection, feature extraction, cleaning, transformation, modeling, and interpretation are not one-way stages. A model can reveal that a feature is malformed, a cleaning rule is too aggressive, or a label definition is inconsistent. Good data mining practice loops back.

The data type shapes the algorithm. A Euclidean distance that makes sense for standardized numeric vectors may fail for text, categories, time series, graphs, or trajectories. Much of the book is about adapting a small set of core tasks to many data types.

Pattern mining, clustering, outlier detection, and classification are reusable components. Web recommendation can use association patterns and matrix factorization. Fraud detection can use outlier scores and supervised classifiers. Text mining can use clustering, classification, and topic models. Graph mining can use frequent patterns, clustering, and classification after topology-aware feature construction.

Scalability is not only an implementation detail. Batch, disk-resident, distributed, and streaming settings create different algorithmic constraints. A method that needs repeated random access to all data may be unsuitable for a stream, even if it is mathematically correct.

Evaluation depends on the task. Clustering can be judged by compactness, separation, stability, or external labels. Classification can be judged by accuracy, precision, recall, cost, calibration, or ranking quality. Outlier detection often needs top-kk review or delayed labels. Association rules need interest measures beyond raw support and confidence.

Visual

Data familyTypical representationCommon mining tasks
Tabular numeric/categoricalMatrix, one-hot tableClassification, clustering, outliers
TransactionsSets or basketsFrequent itemsets, association rules
TextSparse TF-IDF matrix, topicsSearch, clustering, classification
Time seriesOrdered numeric windowsForecasting, motifs, anomalies
Discrete sequencesOrdered symbolsSequential patterns, HMMs, classification
Spatial/trajectoryCoordinates, paths, regionsSpatial clusters, route patterns, local outliers
Graphs/social networksNodes, edges, attributesCommunities, link prediction, graph classification
StreamsSamples, sketches, microclustersOnline counts, drift-aware models, alerts

Worked example 1: Choosing a mining task

Problem. A site records user sessions:

U1: home -> productA -> cart -> checkout
U2: home -> productB -> productA -> exit
U3: home -> productA -> cart -> exit

Choose three different data mining formulations.

Method.

  1. Association pattern mining:

    • Convert each session into a set of visited pages.
    • U1 becomes {home, productA, cart, checkout}.
    • Mine patterns such as {productA, cart}.
  2. Sequence mining:

    • Preserve order.
    • Mine ordered patterns such as (home,productA,cart)(home, productA, cart).
    • This distinguishes productA before cart from cart before productA.
  3. Classification:

    • Label sessions by whether checkout occurred.
    • U1 has label 1; U2 and U3 have label 0.
    • Features may include pages visited, sequence length, time on product pages, or whether cart was reached.

Checked answer. The same raw data can support at least three tasks. The correct formulation depends on the question: co-occurrence, ordered navigation, or checkout prediction.

Worked example 2: Core building blocks on one toy table

Problem. Consider four customers:

customeragevisitsbought
A202no
B223no
C559yes
D5710yes

Show how clustering, classification, and outlier detection would view the table.

Method.

  1. Clustering:

    • Use features age and visits.
    • A and B are close: age differs by 2 and visits by 1.
    • C and D are close for the same reason.
    • A likely clustering is {A,B} and {C,D}.
  2. Classification:

    • Use bought as the label.
    • A and B are negative; C and D are positive.
    • A simple rule could be: if visits 9\ge 9, predict yes.
  3. Outlier detection:

    • With only these four points, no point is extremely isolated.
    • If a new customer E had age 21 and visits 50, E would be unusual because visits is far outside nearby young customers.
  4. Association mining:

    • If numeric features are discretized, one might mine patterns such as {high_visits} -> {bought_yes}.

Checked answer. The table does not have one inherent "mining result." Each task asks a different question and may require different preprocessing.

Code

Pseudocode for selecting a data mining formulation:

INPUT: raw data source S, business or scientific question Q
OUTPUT: mining-ready task definition

identify the object to be predicted, grouped, ranked, or summarized
identify available raw fields and dependency structure
if Q asks for co-occurrence:
build transactions and mine patterns
else if Q asks for groups:
define similarity and cluster objects
else if Q asks for unusual cases:
define normality and score outliers
else if Q asks for prediction:
define labels, features, and evaluation metrics
validate that the representation preserves the needed information
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.tree import DecisionTreeClassifier, export_text

df = pd.DataFrame(
{
"customer": ["A", "B", "C", "D"],
"age": [20, 22, 55, 57],
"visits": [2, 3, 9, 10],
"bought": [0, 0, 1, 1],
}
)

X = df[["age", "visits"]]
clusters = KMeans(n_clusters=2, n_init=10, random_state=0).fit_predict(X)
tree = DecisionTreeClassifier(max_depth=2, random_state=0).fit(X, df["bought"])

print(dict(zip(df["customer"], clusters)))
print(export_text(tree, feature_names=["age", "visits"]))

Common pitfalls

  • Treating data mining as only algorithm selection rather than problem formulation plus preparation.
  • Using the same representation for every data type.
  • Forgetting that association, clustering, outlier detection, and classification answer different questions.
  • Evaluating a model with a metric that does not match the cost or scientific goal.
  • Ignoring scalability until after choosing an algorithm that cannot run on the available data.
  • Treating privacy as an afterthought when data integration has already increased reidentification risk.
  • Reading advanced application chapters without first understanding similarity, preparation, and the core building blocks.

Connections