Skip to main content

Data Preparation

Data preparation is the part of data mining that converts raw evidence into a representation algorithms can actually use. Aggarwal gives it a central role: logs, documents, images, transactions, streams, and graphs rarely arrive as clean rows in a matrix. The analyst must extract features, repair or remove errors, integrate sources, normalize scales, and sometimes convert one data type into another.

This topic sits directly after the mining pipeline because preparation is where many modeling choices become irreversible. A missing value can be imputed, marked, or removed; a numeric value can be standardized or discretized; a categorical value can be one-hot encoded or grouped; a time series can remain ordered or be summarized into statistics. Each choice changes which patterns are visible to later mining algorithms.

Definitions

Feature extraction maps raw objects into variables useful for mining. A document may become a vector of term weights; a time series may become trend, variance, and seasonal features; a graph may become degree, clustering coefficient, or subgraph-count features.

Data type portability is the conversion of one data representation into another so algorithms designed for one type can be used on another. Examples include text to numeric vectors, categories to binary indicators, time series to symbolic sequences, graphs to numeric structural descriptors, or arbitrary objects to a similarity graph.

Data cleaning repairs or controls problems such as missing entries, impossible values, duplicated records, inconsistent units, inconsistent category names, and noisy measurements.

Normalization changes the scale of numeric features. Common methods include min-max scaling,

x=xmin(x)max(x)min(x),x'=\frac{x-\min(x)}{\max(x)-\min(x)},

and z-score standardization,

z=xμσ.z=\frac{x-\mu}{\sigma}.

Discretization maps numeric values into intervals. Equal-width discretization uses intervals of the same size; equal-depth discretization chooses intervals with roughly equal counts.

Binarization converts a categorical attribute with mm categories into mm binary indicators, or sometimes m1m-1 indicators when a baseline is omitted.

Sampling selects a subset of records to reduce cost. In a stream, reservoir sampling maintains a sample of fixed size while items arrive one at a time.

Key results

Cleaning is task dependent. A value that looks like an outlier may be an error in a sales forecasting task but the most important case in a fraud-detection task. Preparation must respect the mining objective.

Imputation should preserve uncertainty where possible. Mean imputation is simple but reduces variance and can create artificial clusters. Model-based or neighborhood-based imputation may preserve local structure better, but it can also leak label information if fitted incorrectly.

Normalization is required when feature magnitude is not inherently meaningful. Distance-based algorithms are especially sensitive. If income is measured in dollars and age in years, raw Euclidean distance mostly measures income. Standardization gives both features comparable influence, provided that such influence is justified.

Reservoir sampling gives an unbiased fixed-size sample of a stream. For a stream item at position tt and reservoir size kk, keep the item with probability k/tk/t. If kept, replace one existing reservoir element uniformly at random. Proof sketch by induction: after tt items, each item has probability k/tk/t of being in the reservoir. Existing item probability equals (k/(t1))(1(k/t)(1/k))=(k/(t1))(11/t)=k/t(k/(t-1)) \cdot (1 - (k/t)(1/k)) = (k/(t-1))(1-1/t)=k/t; the new item is kept with probability k/tk/t.

Type transformations create tradeoffs. Discretization can make noisy continuous values more robust and useful for pattern mining, but it loses order detail within each bin. One-hot encoding makes categories usable for linear models, but high-cardinality categorical variables can create sparse, high-dimensional data.

Visual

ProblemTypical symptomPreparation responseRisk
Missing valuesBlank, null, not measuredImpute, add missing indicator, or filterBias if missingness is informative
Inconsistent unitskg mixed with poundsConvert to common unitSilent errors if unit is unknown
Category spelling"NY", "New York"Standardize category dictionaryOver-merging distinct values
Different scalesAge vs. incomeStandardize or normalizeRemoving meaningful magnitude
High cardinalityThousands of IDsHashing, grouping, embeddingsCollisions or lost detail
Stream volumeCannot store all rowsReservoir sample or sketchApproximation error

Worked example 1: Cleaning and scaling a customer table

Problem. Prepare this table for distance-based clustering:

customerageincomeregion
12030000East
24090000E
3missing60000West
4801200000West

Method.

  1. Standardize categories. Treat E as East if metadata confirms it.

  2. Handle missing age. Use a simple median for this example. Observed ages are 20,40,8020,40,80, so the median is 4040. Customer 3 gets age 4040. Add a missing-age indicator if the missingness may be informative.

  3. Examine income. 1,200,0001{,}200{,}000 is extreme relative to the others. It may be a valid high-income customer, not automatically an error. For a robust distance example, use log income:

log10(30000)=4.477,log10(90000)=4.954,log10(60000)=4.778,log10(1200000)=6.079.\log_{10}(30000)=4.477,\quad \log_{10}(90000)=4.954,\quad \log_{10}(60000)=4.778,\quad \log_{10}(1200000)=6.079.
  1. Standardize numeric columns. For age values (20,40,40,80)(20,40,40,80), mean is 4545 and population standard deviation is
(2045)2+(4045)2+(4045)2+(8045)24=475=21.794.\sqrt{\frac{(20-45)^2+(40-45)^2+(40-45)^2+(80-45)^2}{4}} =\sqrt{475}=21.794.

Customer 1 age z-score is (2045)/21.794=1.147(20-45)/21.794=-1.147.

  1. One-hot encode region:

    customerEastWest
    110
    210
    301
    401

Checked answer. A usable row for customer 1 contains standardized age, standardized log-income, and region indicators. The extreme income is controlled by log transformation, while the original value can still be kept for audit.

Worked example 2: Reservoir sampling by hand

Problem. Maintain a reservoir of size k=3k=3 for a stream of eight items:

A,B,C,D,E,F,G,H.A,B,C,D,E,F,G,H.

Assume the first three items fill the reservoir. Use the following random choices:

Item positionItemKeep?Replacement index
4Dyes2
5Eno-
6Fyes1
7Gno-
8Hyes3

Method.

  1. After first three items: reservoir is [A,B,C][A,B,C].
  2. At t=4t=4, keep D with probability 3/43/4. The table says yes. Replace index 2: [A,D,C][A,D,C].
  3. At t=5t=5, keep E with probability 3/53/5. The table says no: [A,D,C][A,D,C].
  4. At t=6t=6, keep F with probability 3/63/6. The table says yes. Replace index 1: [F,D,C][F,D,C].
  5. At t=7t=7, keep G with probability 3/73/7. The table says no: [F,D,C][F,D,C].
  6. At t=8t=8, keep H with probability 3/83/8. The table says yes. Replace index 3: [F,D,H][F,D,H].

Checked answer. The final reservoir for this random run is [F,D,H][F,D,H]. Over many random runs, each of the eight items has probability 3/83/8 of appearing in the final reservoir.

Code

Pseudocode for robust preparation:

INPUT: table D, schema S, task T
OUTPUT: transformed matrix X

validate each column against schema S
standardize category labels and units
for each feature:
if missing values exist:
choose imputation rule using training data only
optionally add missingness indicator
if numeric and scale-sensitive task:
apply task-appropriate scaling
if categorical and model needs numeric input:
encode categories
if data volume is too large:
sample, sketch, or aggregate
return transformed feature matrix
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

df = pd.DataFrame(
{
"age": [20, 40, np.nan, 80],
"income": [30000, 90000, 60000, 1200000],
"region": ["East", "E", "West", "West"],
}
)
df["region"] = df["region"].replace({"E": "East"})
df["log_income"] = np.log10(df["income"])

numeric = Pipeline(
[
("impute", SimpleImputer(strategy="median", add_indicator=True)),
("scale", StandardScaler()),
]
)

prep = ColumnTransformer(
[
("numeric", numeric, ["age", "log_income"]),
("region", OneHotEncoder(sparse_output=False), ["region"]),
]
)

X = prep.fit_transform(df)
print(np.round(X, 3))

Common pitfalls

  • Imputing missing values before train-test splitting, which leaks test-set information into the model.
  • Normalizing identifiers as if their numeric order has meaning.
  • Removing rare categories that are rare precisely because they are important, such as rare fraud codes.
  • Applying one-hot encoding independently to train and test sets, producing incompatible columns.
  • Treating outliers as errors without checking the task; anomaly detection needs unusual cases.
  • Using a sample that is convenient rather than representative, especially when streams or logs have time-of-day patterns.
  • Discretizing continuous features too aggressively and erasing useful thresholds.

Connections