Feature Engineering

Lesson, slides, and applied problem sets.

View Slides

Lesson

Feature Engineering

Why this module exists

Models are only as good as their features. Feature engineering transforms raw data into representations that make learning easier. This is where domain knowledge meets ML—often making the difference between a mediocre model and a great one.


1) What is feature engineering?

The process of:

  • Selecting relevant features
  • Creating new features from existing ones
  • Transforming features for better model performance
  • Encoding features in suitable formats

Raw data → Engineered features → Model


2) Feature scaling: Why it matters

Many algorithms are sensitive to feature scales:

  • Gradient descent converges faster
  • Distance-based methods (KNN, SVM) work correctly
  • Regularization affects features equally

Without scaling, a feature in millions dominates one in decimals.


3) Min-Max normalization

Scale to [0, 1] range:

def min_max_normalize(x, x_min, x_max):
    return (x - x_min) / (x_max - x_min)

# For a dataset
def fit_transform(X):
    mins = [min(col) for col in zip(*X)]
    maxs = [max(col) for col in zip(*X)]
    normalized = []
    for row in X:
        new_row = [(row[i] - mins[i]) / (maxs[i] - mins[i])
                   for i in range(len(row))]
        normalized.append(new_row)
    return normalized, mins, maxs

Use when:

  • You need bounded values (e.g., neural network inputs)
  • Distribution is relatively uniform

4) Standardization (Z-score)

Transform to mean=0, std=1:

def standardize(x, mean, std):
    return (x - mean) / std

# For a dataset
def fit_transform(X):
    means = [mean(col) for col in zip(*X)]
    stds = [std(col) for col in zip(*X)]
    standardized = []
    for row in X:
        new_row = [(row[i] - means[i]) / stds[i]
                   for i in range(len(row))]
        standardized.append(new_row)
    return standardized, means, stds

Use when:

  • Features have different units
  • Algorithm assumes normally distributed features

5) Handling categorical features

Categorical features (colors, countries, categories) need encoding.

Label encoding

Map categories to integers:

colors = ["red", "blue", "green"]
encoding = {"red": 0, "blue": 1, "green": 2}

Problem: Implies ordering (green > blue > red?)

One-hot encoding

Create binary column per category:

def one_hot_encode(value, categories):
    return [1 if value == cat else 0 for cat in categories]

# "blue" with categories ["red", "blue", "green"]
# → [0, 1, 0]

No false ordering, but increases dimensionality.


6) One-hot encoding implementation

def one_hot_encoder(column):
    # Get unique categories
    categories = list(set(column))
    categories.sort()  # Consistent ordering

    # Encode each value
    encoded = []
    for value in column:
        row = [1 if value == cat else 0 for cat in categories]
        encoded.append(row)

    return encoded, categories

# Usage
data = ["cat", "dog", "cat", "bird", "dog"]
encoded, cats = one_hot_encoder(data)
# encoded: [[1,0,0], [0,0,1], [1,0,0], [0,1,0], [0,0,1]]
# cats: ["bird", "cat", "dog"]

7) Handling missing values

Missing data is everywhere. Strategies:

Drop: Remove rows with missing values

clean_data = [row for row in data if None not in row]

Problem: Loses data

Impute with statistics:

# Fill with mean (numeric)
mean_val = mean([x for x in column if x is not None])
filled = [x if x is not None else mean_val for x in column]

# Fill with mode (categorical)
mode_val = most_common([x for x in column if x is not None])

Indicator variable: Add a column indicating "was missing"


8) Feature creation

Create new features from existing ones:

Polynomial features:

# From [x1, x2], create [x1, x2, x1², x2², x1×x2]
def polynomial_features(x, degree=2):
    features = list(x)
    for i in range(len(x)):
        features.append(x[i] ** 2)
    for i in range(len(x)):
        for j in range(i+1, len(x)):
            features.append(x[i] * x[j])
    return features

Domain-specific:

  • From (birth_date, current_date) → age
  • From (price, quantity) → total
  • From (latitude, longitude) → distance to city center

9) Log transformation

For skewed distributions (long tail):

import math
def log_transform(x):
    return math.log(x + 1)  # +1 to handle zeros

Effects:

  • Compresses large values
  • Spreads small values
  • Makes multiplicative relationships additive

Common for: income, prices, counts


10) Binning (discretization)

Convert continuous to categorical:

def bin_age(age):
    if age < 18:
        return "child"
    elif age < 65:
        return "adult"
    else:
        return "senior"

Use when:

  • Relationship is step-wise, not continuous
  • Reduces noise
  • Enables interactions in linear models

11) Feature selection

Remove irrelevant or redundant features:

Filter methods: Statistical tests

  • Correlation with target
  • Mutual information
  • Chi-squared test

Wrapper methods: Try subsets

  • Forward selection: Add features one by one
  • Backward elimination: Remove features one by one

Embedded methods: Built into training

  • L1 regularization (Lasso)
  • Tree-based feature importance

12) Dealing with high cardinality

Categories with many values (e.g., zip codes):

Frequency encoding: Replace with count/frequency

freq = {val: count(val) for val in column}
encoded = [freq[val] for val in column]

Target encoding: Replace with mean of target

target_mean = {val: mean(target[column == val]) for val in unique(column)}

Careful: Can leak target information (use with cross-validation)

Hash encoding: Hash to fixed number of buckets


13) Practical workflow

  1. Explore: Distributions, missing values, correlations
  2. Clean: Handle missing values, outliers
  3. Transform: Scale numeric, encode categorical
  4. Create: Domain features, interactions
  5. Select: Remove redundant/irrelevant features
  6. Validate: Check with model performance

Iterate! Feature engineering is experimental.


Key takeaways

  • Feature engineering bridges raw data and models
  • Scale features: min-max or standardization
  • Encode categoricals: one-hot for nominal, ordinal for ordered
  • Handle missing values: drop, impute, or indicate
  • Create features: polynomials, logs, domain knowledge
  • Select features: remove redundant, keep informative
  • Iterate based on model performance

Module Items