Feature Engineering

Lesson, slides, and applied problem sets.

Lesson

Feature Engineering

Why this module exists

Models are only as good as their features. Feature engineering transforms raw data into representations that make learning easier. This is where domain knowledge meets ML—often making the difference between a mediocre model and a great one.

1) What is feature engineering?

The process of:

Selecting relevant features
Creating new features from existing ones
Transforming features for better model performance
Encoding features in suitable formats

Raw data → Engineered features → Model

2) Feature scaling: Why it matters

Many algorithms are sensitive to feature scales:

Gradient descent converges faster
Distance-based methods (KNN, SVM) work correctly
Regularization affects features equally

Without scaling, a feature in millions dominates one in decimals.

3) Min-Max normalization

Scale to [0, 1] range:

def min_max_normalize(x, x_min, x_max):
    return (x - x_min) / (x_max - x_min)

# For a dataset
def fit_transform(X):
    mins = [min(col) for col in zip(*X)]
    maxs = [max(col) for col in zip(*X)]
    normalized = []
    for row in X:
        new_row = [(row[i] - mins[i]) / (maxs[i] - mins[i])
                   for i in range(len(row))]
        normalized.append(new_row)
    return normalized, mins, maxs

Use when:

You need bounded values (e.g., neural network inputs)
Distribution is relatively uniform

4) Standardization (Z-score)

Transform to mean=0, std=1:

def standardize(x, mean, std):
    return (x - mean) / std

# For a dataset
def fit_transform(X):
    means = [mean(col) for col in zip(*X)]
    stds = [std(col) for col in zip(*X)]
    standardized = []
    for row in X:
        new_row = [(row[i] - means[i]) / stds[i]
                   for i in range(len(row))]
        standardized.append(new_row)
    return standardized, means, stds

Use when:

Features have different units
Algorithm assumes normally distributed features

5) Handling categorical features

Categorical features (colors, countries, categories) need encoding.

Label encoding

Map categories to integers:

colors = ["red", "blue", "green"]
encoding = {"red": 0, "blue": 1, "green": 2}

Problem: Implies ordering (green > blue > red?)

One-hot encoding

Create binary column per category:

def one_hot_encode(value, categories):
    return [1 if value == cat else 0 for cat in categories]

# "blue" with categories ["red", "blue", "green"]
# → [0, 1, 0]

No false ordering, but increases dimensionality.

6) One-hot encoding implementation

def one_hot_encoder(column):
    # Get unique categories
    categories = list(set(column))
    categories.sort()  # Consistent ordering

    # Encode each value
    encoded = []
    for value in column:
        row = [1 if value == cat else 0 for cat in categories]
        encoded.append(row)

    return encoded, categories

# Usage
data = ["cat", "dog", "cat", "bird", "dog"]
encoded, cats = one_hot_encoder(data)
# encoded: [[1,0,0], [0,0,1], [1,0,0], [0,1,0], [0,0,1]]
# cats: ["bird", "cat", "dog"]

7) Handling missing values

Missing data is everywhere. Strategies:

Drop: Remove rows with missing values

clean_data = [row for row in data if None not in row]

Problem: Loses data

Impute with statistics:

# Fill with mean (numeric)
mean_val = mean([x for x in column if x is not None])
filled = [x if x is not None else mean_val for x in column]

# Fill with mode (categorical)
mode_val = most_common([x for x in column if x is not None])

Indicator variable: Add a column indicating "was missing"

8) Feature creation

Create new features from existing ones:

Polynomial features:

# From [x1, x2], create [x1, x2, x1², x2², x1×x2]
def polynomial_features(x, degree=2):
    features = list(x)
    for i in range(len(x)):
        features.append(x[i] ** 2)
    for i in range(len(x)):
        for j in range(i+1, len(x)):
            features.append(x[i] * x[j])
    return features

Domain-specific:

From (birth_date, current_date) → age
From (price, quantity) → total
From (latitude, longitude) → distance to city center

9) Log transformation

For skewed distributions (long tail):

import math
def log_transform(x):
    return math.log(x + 1)  # +1 to handle zeros

Effects:

Compresses large values
Spreads small values
Makes multiplicative relationships additive

Common for: income, prices, counts

10) Binning (discretization)

Convert continuous to categorical:

def bin_age(age):
    if age < 18:
        return "child"
    elif age < 65:
        return "adult"
    else:
        return "senior"

Use when:

Relationship is step-wise, not continuous
Reduces noise
Enables interactions in linear models

11) Feature selection

Remove irrelevant or redundant features:

Filter methods: Statistical tests

Correlation with target
Mutual information
Chi-squared test

Wrapper methods: Try subsets

Forward selection: Add features one by one
Backward elimination: Remove features one by one

Embedded methods: Built into training

L1 regularization (Lasso)
Tree-based feature importance

12) Dealing with high cardinality

Categories with many values (e.g., zip codes):

Frequency encoding: Replace with count/frequency

freq = {val: count(val) for val in column}
encoded = [freq[val] for val in column]

Target encoding: Replace with mean of target

target_mean = {val: mean(target[column == val]) for val in unique(column)}

Careful: Can leak target information (use with cross-validation)

Hash encoding: Hash to fixed number of buckets

13) Practical workflow

Explore: Distributions, missing values, correlations
Clean: Handle missing values, outliers
Transform: Scale numeric, encode categorical
Create: Domain features, interactions
Select: Remove redundant/irrelevant features
Validate: Check with model performance

Iterate! Feature engineering is experimental.

Key takeaways

Feature engineering bridges raw data and models
Scale features: min-max or standardization
Encode categoricals: one-hot for nominal, ordinal for ordered
Handle missing values: drop, impute, or indicate
Create features: polynomials, logs, domain knowledge
Select features: remove redundant, keep informative
Iterate based on model performance

Feature Engineering

Lesson

Feature Engineering

Why this module exists

1) What is feature engineering?

2) Feature scaling: Why it matters

3) Min-Max normalization

4) Standardization (Z-score)

5) Handling categorical features

Label encoding

One-hot encoding

6) One-hot encoding implementation

7) Handling missing values

8) Feature creation

9) Log transformation

10) Binning (discretization)

11) Feature selection

12) Dealing with high cardinality

13) Practical workflow

Key takeaways

Module Items

Feature Scaling

One-Hot Encoding