Descriptive Statistics

Lesson, slides, and applied problem sets.

View Slides

Lesson

Descriptive Statistics

Why this module exists

Before you can build models, you need to understand your data. Descriptive statistics provide the vocabulary and tools to summarize datasets: their center, spread, and shape. These concepts also underpin feature engineering, normalization, and understanding model behavior.


1) Measures of central tendency

These tell you where the "middle" of your data is.

Mean (average)

Sum of values divided by count:

mean = sum(values) / len(values)

Properties:

  • Sensitive to outliers
  • Minimizes squared distances

Median

The middle value when sorted. For even count, average the two middle values.

Properties:

  • Robust to outliers
  • Good for skewed distributions

Mode

Most frequently occurring value. Can have multiple modes.


2) Measures of spread

These tell you how spread out your data is.

Range

range = max - min

Simple but sensitive to outliers.

Variance

Average squared distance from the mean:

variance = sum((x - mean)² for x in values) / n

Sample variance uses (n-1) to be unbiased:

sample_variance = sum((x - mean)² for x in values) / (n - 1)

Standard Deviation

Square root of variance. Same units as the data:

std = sqrt(variance)

Intuition: "typical" distance from the mean.


3) Percentiles and quartiles

Percentile: Value below which a percentage of data falls.

  • 50th percentile = median
  • 25th percentile = Q1 (first quartile)
  • 75th percentile = Q3 (third quartile)

Interquartile Range (IQR): Q3 - Q1

  • Robust measure of spread
  • Used for outlier detection

4) The normal distribution

Many natural phenomena follow the Gaussian/normal distribution:

  • Bell-shaped, symmetric
  • Characterized by mean μ and std σ
  • 68% of data within 1σ of mean
  • 95% within 2σ
  • 99.7% within 3σ
p(x) = (1/√(2πσ²)) * exp(-(x-μ)²/(2σ²))

The Central Limit Theorem: Averages of many samples tend toward normal, regardless of the original distribution.


5) Skewness and kurtosis

Skewness: Measure of asymmetry

  • Positive skew: long right tail (mean > median)
  • Negative skew: long left tail (mean < median)
  • Zero skew: symmetric

Kurtosis: Measure of tail heaviness

  • High kurtosis: heavy tails, sharp peak
  • Low kurtosis: light tails, flat peak

6) Correlation

Measures linear relationship between two variables.

Pearson correlation coefficient (r):

r = cov(X, Y) / (std(X) * std(Y))

Properties:

  • Range: [-1, 1]
  • +1: perfect positive linear relationship
  • -1: perfect negative linear relationship
  • 0: no linear relationship (could still be nonlinear!)

Covariance:

cov(X, Y) = mean((X - mean_X) * (Y - mean_Y))

7) Covariance matrix

For multiple variables, the covariance matrix captures all pairwise covariances:

Σ[i,j] = cov(variable_i, variable_j)

Properties:

  • Symmetric (Σ = Σᵀ)
  • Diagonal elements are variances
  • Off-diagonal elements are covariances

Used in:

  • PCA
  • Gaussian distributions
  • Mahalanobis distance

8) Z-scores (standardization)

Transform data to have mean 0 and std 1:

z = (x - mean) / std

Why standardize?

  • Compare values from different distributions
  • Many ML algorithms assume standardized input
  • Removes scale differences between features

A z-score tells you how many standard deviations from the mean.


9) Outliers

Values that are unusually far from the rest.

Detection methods:

  • Z-score: |z| > 3 (3 standard deviations)
  • IQR: x < Q1 - 1.5×IQR or x > Q3 + 1.5×IQR

What to do:

  • Investigate: Are they errors or real?
  • Remove: If errors or corrupted data
  • Transform: Log transform can reduce impact
  • Keep: If they're real and informative

10) Descriptive statistics in practice

When you get a new dataset:

  1. Compute basic stats: mean, median, std, min, max for each feature
  2. Check for missing values: Count NaN/null
  3. Look at distributions: Histograms, box plots
  4. Check correlations: Correlation matrix, scatter plots
  5. Identify outliers: Z-scores, IQR method
  6. Understand data types: Numeric vs categorical

This exploratory data analysis (EDA) should happen before any modeling.


Key takeaways

  • Mean, median, mode describe center; use median for skewed data
  • Variance and std describe spread; std has interpretable units
  • Correlation measures linear relationship only (-1 to +1)
  • Standardization (z-scores) enables fair comparison across scales
  • Always explore data before modeling: distributions, outliers, missing values
  • The normal distribution is everywhere; understand its properties

Module Items