Allocator Optimization

Lesson, slides, and applied problem sets.

View Slides

Lesson

Block Splitting and Allocator Optimization

When you allocate 50 bytes from a 1000-byte free block, 950 bytes are wasted. Block splitting solves this. This module also covers advanced topics: debugging with gdb, mmap for large allocations, and thread safety.

The Problem: Internal Fragmentation

Request: malloc(50)
Found:   [Block: 1000 bytes free]
Result:  [Block: 1000 bytes USED, but only 50 needed]
         950 bytes wasted inside the block!

This is internal fragmentation - waste inside allocated blocks.

The Solution: Split the Block

Before:
[Header | 1000 bytes free                              ]

After malloc(50) with splitting:
[Header | 50 used ][New Header | ~926 bytes free      ]

Now future allocations can use the remaining space.

When to Split

Only split if the remainder is useful:

#define BLOCK_MAGIC 0xDEADBEEF
#define MIN_BLOCK_SIZE (sizeof(block_header_t) + 16)

int can_split(block_header_t *block, size_t size) {
    size_t aligned = align8(size);
    if (block->size < aligned) return 0;

    size_t remaining = block->size - aligned;
    return remaining >= MIN_BLOCK_SIZE;
}

Don't split if remaining space is tiny - you'd just create an unusable fragment.

Splitting Implementation

block_header_t *split_block(block_header_t *block, size_t size) {
    size_t aligned = align8(size);

    if (!can_split(block, size)) {
        return NULL;  // No split
    }

    // Find where new block starts
    block_header_t *new_block = (block_header_t*)(
        (char*)block + sizeof(block_header_t) + aligned
    );

    // Initialize new block with magic for validation
    new_block->size = block->size - aligned - sizeof(block_header_t);
    new_block->free = 1;
    new_block->magic = BLOCK_MAGIC;
    new_block->next = block->next;

    // Update original block
    block->size = aligned;
    block->next = new_block;

    return new_block;
}

Debugging Allocators with GDB

Memory allocator bugs are notoriously hard to find. GDB is essential:

Basic Commands

# Compile with debug symbols
gcc -g -O0 my_malloc.c test.c -o test

# Start gdb
gdb ./test

Inspecting Block Headers

# Break when entering malloc
(gdb) break my_malloc

# Run until breakpoint
(gdb) run

# Print block header
(gdb) print *block
$1 = {size = 104, free = 1, magic = 3735928559, next = 0x0}

# Verify magic (0xDEADBEEF = 3735928559)
(gdb) print/x block->magic
$2 = 0xdeadbeef

# Walk the free list
(gdb) set $b = free_list_head
(gdb) while $b != 0
 > print *$b
 > set $b = $b->next
 > end

Catching Corruption

# Watchpoint on magic field corruption
(gdb) watch block->magic
(gdb) condition 1 block->magic != 0xDEADBEEF

# Break on invalid free
(gdb) break my_free if block->magic != 0xDEADBEEF

Common Debugging Patterns

  1. Use-after-free: Set freed memory to 0xFEEDFACE, break if accessed
  2. Double-free: Check if block already marked free
  3. Buffer overflow: Write sentinel patterns at block boundaries

Using Your Allocator in Real Programs

Once the allocator works, you can load it into existing binaries without recompiling them.

Build a Shared Library

Linux:

clang -O0 -g -W -Wall -Wextra -shared -fPIC malloc.c -o malloc.so

macOS:

clang -O0 -g -W -Wall -Wextra -dynamiclib malloc.c -o malloc.dylib

Note: sbrk is deprecated on recent macOS versions, so behavior can be unreliable. For learning, it's fine; for production, use mmap/vm_allocate-style approaches instead.

Inject into a Process

Linux (bash/zsh):

export LD_PRELOAD=/absolute/path/malloc.so

macOS:

export DYLD_INSERT_LIBRARIES=/absolute/path/malloc.dylib

Now run any binary (e.g., ls) and it will use your allocator.

mmap for Large Allocations

Real allocators use mmap for large allocations instead of sbrk:

#include <sys/mman.h>

#define MMAP_THRESHOLD (128 * 1024)  // 128KB

void *allocate_large(size_t size) {
    size_t total = sizeof(block_header_t) + size;

    void *ptr = mmap(NULL, total,
                     PROT_READ | PROT_WRITE,
                     MAP_PRIVATE | MAP_ANONYMOUS,
                     -1, 0);

    if (ptr == MAP_FAILED) return NULL;

    block_header_t *block = ptr;
    block->size = size;
    block->free = 0;
    block->magic = BLOCK_MAGIC;
    block->next = NULL;

    return (void*)(block + 1);
}

void free_large(void *ptr, size_t size) {
    block_header_t *block = ((block_header_t*)ptr) - 1;
    munmap(block, sizeof(block_header_t) + size);
}

Why mmap for Large Blocks?

  1. Returns memory to OS: munmap immediately frees pages
  2. No fragmentation: Each allocation is independent
  3. Security: Fresh pages are zero-initialized
  4. Virtual address space: Doesn't grow the heap

Hybrid Strategy

void *my_malloc(size_t size) {
    if (size >= MMAP_THRESHOLD) {
        return allocate_large(size);
    }
    // Use sbrk-based allocator for small blocks
    return allocate_from_heap(size);
}

Thread Safety

Our allocator is NOT thread-safe. Here's why and how to fix it:

The Race Condition

// Thread 1                      // Thread 2
block = find_free_block(100);    block = find_free_block(100);
// Both find same block!
block->free = 0;                 block->free = 0;
return payload(block);           return payload(block);
// BOTH return the SAME memory!

Solution 1: Global Lock (Simple but Slow)

#include <pthread.h>

static pthread_mutex_t malloc_lock = PTHREAD_MUTEX_INITIALIZER;

void *my_malloc(size_t size) {
    pthread_mutex_lock(&malloc_lock);
    void *result = my_malloc_internal(size);
    pthread_mutex_unlock(&malloc_lock);
    return result;
}

void my_free(void *ptr) {
    pthread_mutex_lock(&malloc_lock);
    my_free_internal(ptr);
    pthread_mutex_unlock(&malloc_lock);
}

Solution 2: Per-Thread Arenas (Fast but Complex)

Production allocators (jemalloc, tcmalloc) use thread-local arenas:

__thread arena_t *thread_arena = NULL;

void *my_malloc(size_t size) {
    if (!thread_arena) {
        thread_arena = create_arena();
    }
    return arena_alloc(thread_arena, size);
}

Benefits:

  • No lock contention between threads
  • Cache-friendly (each thread has local data)
  • Scales with CPU cores

Solution 3: Lock-Free Structures (Expert Level)

Using atomic operations and CAS (Compare-And-Swap):

#include <stdatomic.h>

_Atomic block_header_t *free_list_head = NULL;

void add_to_free_list(block_header_t *block) {
    block_header_t *expected;
    do {
        expected = atomic_load(&free_list_head);
        block->next = expected;
    } while (!atomic_compare_exchange_weak(
        &free_list_head, &expected, block));
}

Trade-offs Summary

Block Splitting:

  • Pros: Reduces internal fragmentation
  • Cons: More blocks, slight overhead

mmap for Large Blocks:

  • Pros: Returns memory to OS, no fragmentation
  • Cons: System call overhead, not for small allocations

Thread Safety:

  • Global lock: Simple but contention bottleneck
  • Per-thread arenas: Fast but memory overhead
  • Lock-free: Fastest but very complex to implement correctly

Minimum Block Size

Choosing MIN_BLOCK_SIZE wisely:

  • Too small: create many unusable fragments
  • Too large: waste space that could be used

16 bytes is reasonable - can hold at least two pointers or small data.


Module Items