Feature Engineering
Lesson, slides, and applied problem sets.
View SlidesLesson
Feature Engineering
Why this module exists
Models are only as good as their features. Feature engineering transforms raw data into representations that make learning easier. This is where domain knowledge meets ML—often making the difference between a mediocre model and a great one.
1) What is feature engineering?
The process of:
- Selecting relevant features
- Creating new features from existing ones
- Transforming features for better model performance
- Encoding features in suitable formats
Raw data → Engineered features → Model
2) Feature scaling: Why it matters
Many algorithms are sensitive to feature scales:
- Gradient descent converges faster
- Distance-based methods (KNN, SVM) work correctly
- Regularization affects features equally
Without scaling, a feature in millions dominates one in decimals.
3) Min-Max normalization
Scale to [0, 1] range:
def min_max_normalize(x, x_min, x_max):
return (x - x_min) / (x_max - x_min)
# For a dataset
def fit_transform(X):
mins = [min(col) for col in zip(*X)]
maxs = [max(col) for col in zip(*X)]
normalized = []
for row in X:
new_row = [(row[i] - mins[i]) / (maxs[i] - mins[i])
for i in range(len(row))]
normalized.append(new_row)
return normalized, mins, maxs
Use when:
- You need bounded values (e.g., neural network inputs)
- Distribution is relatively uniform
4) Standardization (Z-score)
Transform to mean=0, std=1:
def standardize(x, mean, std):
return (x - mean) / std
# For a dataset
def fit_transform(X):
means = [mean(col) for col in zip(*X)]
stds = [std(col) for col in zip(*X)]
standardized = []
for row in X:
new_row = [(row[i] - means[i]) / stds[i]
for i in range(len(row))]
standardized.append(new_row)
return standardized, means, stds
Use when:
- Features have different units
- Algorithm assumes normally distributed features
5) Handling categorical features
Categorical features (colors, countries, categories) need encoding.
Label encoding
Map categories to integers:
colors = ["red", "blue", "green"]
encoding = {"red": 0, "blue": 1, "green": 2}
Problem: Implies ordering (green > blue > red?)
One-hot encoding
Create binary column per category:
def one_hot_encode(value, categories):
return [1 if value == cat else 0 for cat in categories]
# "blue" with categories ["red", "blue", "green"]
# → [0, 1, 0]
No false ordering, but increases dimensionality.
6) One-hot encoding implementation
def one_hot_encoder(column):
# Get unique categories
categories = list(set(column))
categories.sort() # Consistent ordering
# Encode each value
encoded = []
for value in column:
row = [1 if value == cat else 0 for cat in categories]
encoded.append(row)
return encoded, categories
# Usage
data = ["cat", "dog", "cat", "bird", "dog"]
encoded, cats = one_hot_encoder(data)
# encoded: [[1,0,0], [0,0,1], [1,0,0], [0,1,0], [0,0,1]]
# cats: ["bird", "cat", "dog"]
7) Handling missing values
Missing data is everywhere. Strategies:
Drop: Remove rows with missing values
clean_data = [row for row in data if None not in row]
Problem: Loses data
Impute with statistics:
# Fill with mean (numeric)
mean_val = mean([x for x in column if x is not None])
filled = [x if x is not None else mean_val for x in column]
# Fill with mode (categorical)
mode_val = most_common([x for x in column if x is not None])
Indicator variable: Add a column indicating "was missing"
8) Feature creation
Create new features from existing ones:
Polynomial features:
# From [x1, x2], create [x1, x2, x1², x2², x1×x2]
def polynomial_features(x, degree=2):
features = list(x)
for i in range(len(x)):
features.append(x[i] ** 2)
for i in range(len(x)):
for j in range(i+1, len(x)):
features.append(x[i] * x[j])
return features
Domain-specific:
- From (birth_date, current_date) → age
- From (price, quantity) → total
- From (latitude, longitude) → distance to city center
9) Log transformation
For skewed distributions (long tail):
import math
def log_transform(x):
return math.log(x + 1) # +1 to handle zeros
Effects:
- Compresses large values
- Spreads small values
- Makes multiplicative relationships additive
Common for: income, prices, counts
10) Binning (discretization)
Convert continuous to categorical:
def bin_age(age):
if age < 18:
return "child"
elif age < 65:
return "adult"
else:
return "senior"
Use when:
- Relationship is step-wise, not continuous
- Reduces noise
- Enables interactions in linear models
11) Feature selection
Remove irrelevant or redundant features:
Filter methods: Statistical tests
- Correlation with target
- Mutual information
- Chi-squared test
Wrapper methods: Try subsets
- Forward selection: Add features one by one
- Backward elimination: Remove features one by one
Embedded methods: Built into training
- L1 regularization (Lasso)
- Tree-based feature importance
12) Dealing with high cardinality
Categories with many values (e.g., zip codes):
Frequency encoding: Replace with count/frequency
freq = {val: count(val) for val in column}
encoded = [freq[val] for val in column]
Target encoding: Replace with mean of target
target_mean = {val: mean(target[column == val]) for val in unique(column)}
Careful: Can leak target information (use with cross-validation)
Hash encoding: Hash to fixed number of buckets
13) Practical workflow
- Explore: Distributions, missing values, correlations
- Clean: Handle missing values, outliers
- Transform: Scale numeric, encode categorical
- Create: Domain features, interactions
- Select: Remove redundant/irrelevant features
- Validate: Check with model performance
Iterate! Feature engineering is experimental.
Key takeaways
- Feature engineering bridges raw data and models
- Scale features: min-max or standardization
- Encode categoricals: one-hot for nominal, ordinal for ordered
- Handle missing values: drop, impute, or indicate
- Create features: polynomials, logs, domain knowledge
- Select features: remove redundant, keep informative
- Iterate based on model performance