Hash Functions & Merkle Structures

Lesson, slides, and applied problem sets.

View Slides

Lesson

Hash Functions & Merkle Structures

Why this module exists

Hashes are everywhere: integrity checks, passwords, signatures, blockchains, content-addressed storage, and protocols. But using them safely requires more than calling sha256.Sum256. You must understand what properties are actually provided, how the construction works, and where the attacks live.

This module goes deep on hash function security, Merkle-Damgard structure, length extension, domain separation, and Merkle trees. You will implement SHA-256 from scratch and perform a real attack on a naive MAC.


1) What a hash function promises (and does not)

A cryptographic hash is a deterministic function H(m) -> n bits with three core security properties:

  • Preimage resistance: given y, hard to find m where H(m)=y.
  • Second preimage resistance: given m, hard to find m' != m where H(m')=H(m).
  • Collision resistance: hard to find any m1 != m2 with same hash.

These are different properties with different attack costs.

Generic costs for an n-bit hash

  • Preimage: ~2^n
  • Second preimage: ~2^n
  • Collision: ~2^(n/2) (birthday bound)

Collision resistance is weaker. That is why a 256-bit hash gives only 128-bit collision security.


2) Merkle-Damgard construction

Many classic hashes (MD5, SHA-1, SHA-256) are built using Merkle-Damgard:

  1. Pad the message to a multiple of the block size.
  2. Iterate a compression function over fixed-size blocks.
  3. The final chaining value is the digest.

This gives streaming and incremental hashing: you can hash data as it arrives. But it also introduces a structural weakness: length extension.


3) Padding: the hidden part of the message

Merkle-Damgard padding is precise:

  • Append a 1 bit (0x80).
  • Append 0 bits until the message length is 56 mod 64 bytes.
  • Append the original length (in bits) as a 64-bit big-endian integer.

If two different messages result in the same padded blocks, the hash will match. That is why you must use a canonical encoding before hashing.


4) Length extension attacks

If a system uses a naive MAC like:

MAC = H(key || message)

an attacker who knows MAC and message can compute a valid MAC for message || padding || suffix without knowing the key. This is a length extension attack and it breaks many homegrown protocols.

The fix is HMAC or any construction that hides the internal state and breaks the "continue hashing" property.


5) HMAC vs "hash + key"

HMAC does:

HMAC = H((key xor opad) || H((key xor ipad) || message))

Because the outer hash is keyed, the attacker cannot continue hashing even if they know the inner digest. This is why HMAC is safe against length extension.


6) Merkle trees

A Merkle tree lets you commit to a set of leaves with a single root hash.

  • Leaves are hashed items.
  • Internal nodes hash the concatenation of child hashes.
  • A proof for one leaf is the list of sibling hashes from leaf to root.

Properties:

  • Root changes if any leaf changes.
  • Proof size is O(log N).
  • Verification is O(log N).

These trees power blockchains, file synchronization, and tamper-evident logs.


7) Domain separation and encoding

Hash inputs must be unambiguous. If you hash concatenations, you must separate domains and encode lengths to avoid collisions like:

H("ab" || "c") == H("a" || "bc")

Fixes:

  • Prefix with a tag (0x00 for leaf, 0x01 for node).
  • Include lengths in a fixed-width format.
  • Use structured serialization with canonical rules.

8) Practical pitfalls

  • Hashing structured data without canonicalization (e.g., JSON with different key orders).
  • Using a raw hash as a MAC (length extension risk).
  • Reusing hashes across protocols without domain separation.
  • Assuming collision resistance implies preimage resistance (it does not).

What you will build

  1. SHA-256 from scratch: padding, message schedule, compression, and digest output.
  2. Merkle root + proofs: build a tree and verify membership proofs.
  3. Length extension attack: forge a MAC without the secret key.
  4. Hash commitments: build a binding/hiding commitment with domain separation.

Key takeaways

  • Hash security is about precise properties and attack costs.
  • Merkle-Damgard gives streaming but enables length extension.
  • HMAC exists for a reason. Do not roll your own MAC.
  • Merkle trees give logarithmic proofs and tamper-evident structure.
  • Encoding and domain separation prevent real-world collisions.

Module Items