Descriptive Statistics

Lesson, slides, and applied problem sets.

Lesson

Descriptive Statistics

Why this module exists

Before you can build models, you need to understand your data. Descriptive statistics provide the vocabulary and tools to summarize datasets: their center, spread, and shape. These concepts also underpin feature engineering, normalization, and understanding model behavior.

1) Measures of central tendency

These tell you where the "middle" of your data is.

Mean (average)

Sum of values divided by count:

mean = sum(values) / len(values)

Properties:

Sensitive to outliers
Minimizes squared distances

Median

The middle value when sorted. For even count, average the two middle values.

Properties:

Robust to outliers
Good for skewed distributions

Mode

Most frequently occurring value. Can have multiple modes.

2) Measures of spread

These tell you how spread out your data is.

Range

range = max - min

Simple but sensitive to outliers.

Variance

Average squared distance from the mean:

variance = sum((x - mean)² for x in values) / n

Sample variance uses (n-1) to be unbiased:

sample_variance = sum((x - mean)² for x in values) / (n - 1)

Standard Deviation

Square root of variance. Same units as the data:

std = sqrt(variance)

Intuition: "typical" distance from the mean.

3) Percentiles and quartiles

Percentile: Value below which a percentage of data falls.

50th percentile = median
25th percentile = Q1 (first quartile)
75th percentile = Q3 (third quartile)

Interquartile Range (IQR): Q3 - Q1

Robust measure of spread
Used for outlier detection

4) The normal distribution

Many natural phenomena follow the Gaussian/normal distribution:

Bell-shaped, symmetric
Characterized by mean μ and std σ
68% of data within 1σ of mean
95% within 2σ
99.7% within 3σ

p(x) = (1/√(2πσ²)) * exp(-(x-μ)²/(2σ²))

The Central Limit Theorem: Averages of many samples tend toward normal, regardless of the original distribution.

5) Skewness and kurtosis

Skewness: Measure of asymmetry

Positive skew: long right tail (mean > median)
Negative skew: long left tail (mean < median)
Zero skew: symmetric

Kurtosis: Measure of tail heaviness

High kurtosis: heavy tails, sharp peak
Low kurtosis: light tails, flat peak

6) Correlation

Measures linear relationship between two variables.

Pearson correlation coefficient (r):

r = cov(X, Y) / (std(X) * std(Y))

Properties:

Range: [-1, 1]
+1: perfect positive linear relationship
-1: perfect negative linear relationship
0: no linear relationship (could still be nonlinear!)

Covariance:

cov(X, Y) = mean((X - mean_X) * (Y - mean_Y))

7) Covariance matrix

For multiple variables, the covariance matrix captures all pairwise covariances:

Σ[i,j] = cov(variable_i, variable_j)

Properties:

Symmetric (Σ = Σᵀ)
Diagonal elements are variances
Off-diagonal elements are covariances

Used in:

PCA
Gaussian distributions
Mahalanobis distance

8) Z-scores (standardization)

Transform data to have mean 0 and std 1:

z = (x - mean) / std

Why standardize?

Compare values from different distributions
Many ML algorithms assume standardized input
Removes scale differences between features

A z-score tells you how many standard deviations from the mean.

9) Outliers

Values that are unusually far from the rest.

Detection methods:

Z-score: |z| > 3 (3 standard deviations)
IQR: x < Q1 - 1.5×IQR or x > Q3 + 1.5×IQR

What to do:

Investigate: Are they errors or real?
Remove: If errors or corrupted data
Transform: Log transform can reduce impact
Keep: If they're real and informative

10) Descriptive statistics in practice

When you get a new dataset:

Compute basic stats: mean, median, std, min, max for each feature
Check for missing values: Count NaN/null
Look at distributions: Histograms, box plots
Check correlations: Correlation matrix, scatter plots
Identify outliers: Z-scores, IQR method
Understand data types: Numeric vs categorical

This exploratory data analysis (EDA) should happen before any modeling.

Key takeaways

Mean, median, mode describe center; use median for skewed data
Variance and std describe spread; std has interpretable units
Correlation measures linear relationship only (-1 to +1)
Standardization (z-scores) enables fair comparison across scales
Always explore data before modeling: distributions, outliers, missing values
The normal distribution is everywhere; understand its properties

Descriptive Statistics

Lesson

Descriptive Statistics

Why this module exists

1) Measures of central tendency

Mean (average)

Median

Mode

2) Measures of spread

Range

Variance

Standard Deviation

3) Percentiles and quartiles

4) The normal distribution

5) Skewness and kurtosis

6) Correlation

7) Covariance matrix

8) Z-scores (standardization)

9) Outliers

10) Descriptive statistics in practice

Key takeaways

Module Items

Statistics From Scratch

Statistics Checkpoint