Descriptive Statistics
Lesson, slides, and applied problem sets.
View SlidesLesson
Descriptive Statistics
Why this module exists
Before you can build models, you need to understand your data. Descriptive statistics provide the vocabulary and tools to summarize datasets: their center, spread, and shape. These concepts also underpin feature engineering, normalization, and understanding model behavior.
1) Measures of central tendency
These tell you where the "middle" of your data is.
Mean (average)
Sum of values divided by count:
mean = sum(values) / len(values)
Properties:
- Sensitive to outliers
- Minimizes squared distances
Median
The middle value when sorted. For even count, average the two middle values.
Properties:
- Robust to outliers
- Good for skewed distributions
Mode
Most frequently occurring value. Can have multiple modes.
2) Measures of spread
These tell you how spread out your data is.
Range
range = max - min
Simple but sensitive to outliers.
Variance
Average squared distance from the mean:
variance = sum((x - mean)² for x in values) / n
Sample variance uses (n-1) to be unbiased:
sample_variance = sum((x - mean)² for x in values) / (n - 1)
Standard Deviation
Square root of variance. Same units as the data:
std = sqrt(variance)
Intuition: "typical" distance from the mean.
3) Percentiles and quartiles
Percentile: Value below which a percentage of data falls.
- 50th percentile = median
- 25th percentile = Q1 (first quartile)
- 75th percentile = Q3 (third quartile)
Interquartile Range (IQR): Q3 - Q1
- Robust measure of spread
- Used for outlier detection
4) The normal distribution
Many natural phenomena follow the Gaussian/normal distribution:
- Bell-shaped, symmetric
- Characterized by mean μ and std σ
- 68% of data within 1σ of mean
- 95% within 2σ
- 99.7% within 3σ
p(x) = (1/√(2πσ²)) * exp(-(x-μ)²/(2σ²))
The Central Limit Theorem: Averages of many samples tend toward normal, regardless of the original distribution.
5) Skewness and kurtosis
Skewness: Measure of asymmetry
- Positive skew: long right tail (mean > median)
- Negative skew: long left tail (mean < median)
- Zero skew: symmetric
Kurtosis: Measure of tail heaviness
- High kurtosis: heavy tails, sharp peak
- Low kurtosis: light tails, flat peak
6) Correlation
Measures linear relationship between two variables.
Pearson correlation coefficient (r):
r = cov(X, Y) / (std(X) * std(Y))
Properties:
- Range: [-1, 1]
- +1: perfect positive linear relationship
- -1: perfect negative linear relationship
- 0: no linear relationship (could still be nonlinear!)
Covariance:
cov(X, Y) = mean((X - mean_X) * (Y - mean_Y))
7) Covariance matrix
For multiple variables, the covariance matrix captures all pairwise covariances:
Σ[i,j] = cov(variable_i, variable_j)
Properties:
- Symmetric (Σ = Σᵀ)
- Diagonal elements are variances
- Off-diagonal elements are covariances
Used in:
- PCA
- Gaussian distributions
- Mahalanobis distance
8) Z-scores (standardization)
Transform data to have mean 0 and std 1:
z = (x - mean) / std
Why standardize?
- Compare values from different distributions
- Many ML algorithms assume standardized input
- Removes scale differences between features
A z-score tells you how many standard deviations from the mean.
9) Outliers
Values that are unusually far from the rest.
Detection methods:
- Z-score: |z| > 3 (3 standard deviations)
- IQR: x < Q1 - 1.5×IQR or x > Q3 + 1.5×IQR
What to do:
- Investigate: Are they errors or real?
- Remove: If errors or corrupted data
- Transform: Log transform can reduce impact
- Keep: If they're real and informative
10) Descriptive statistics in practice
When you get a new dataset:
- Compute basic stats: mean, median, std, min, max for each feature
- Check for missing values: Count NaN/null
- Look at distributions: Histograms, box plots
- Check correlations: Correlation matrix, scatter plots
- Identify outliers: Z-scores, IQR method
- Understand data types: Numeric vs categorical
This exploratory data analysis (EDA) should happen before any modeling.
Key takeaways
- Mean, median, mode describe center; use median for skewed data
- Variance and std describe spread; std has interpretable units
- Correlation measures linear relationship only (-1 to +1)
- Standardization (z-scores) enables fair comparison across scales
- Always explore data before modeling: distributions, outliers, missing values
- The normal distribution is everywhere; understand its properties