Hash Functions & Merkle Structures
Lesson, slides, and applied problem sets.
View SlidesLesson
Hash Functions & Merkle Structures
Why this module exists
Hashes are everywhere: integrity checks, passwords, signatures, blockchains, content-addressed storage, and protocols. But using them safely requires more than calling sha256.Sum256. You must understand what properties are actually provided, how the construction works, and where the attacks live.
This module goes deep on hash function security, Merkle-Damgard structure, length extension, domain separation, and Merkle trees. You will implement SHA-256 from scratch and perform a real attack on a naive MAC.
1) What a hash function promises (and does not)
A cryptographic hash is a deterministic function H(m) -> n bits with three core security properties:
- Preimage resistance: given y, hard to find m where H(m)=y.
- Second preimage resistance: given m, hard to find m' != m where H(m')=H(m).
- Collision resistance: hard to find any m1 != m2 with same hash.
These are different properties with different attack costs.
Generic costs for an n-bit hash
- Preimage: ~2^n
- Second preimage: ~2^n
- Collision: ~2^(n/2) (birthday bound)
Collision resistance is weaker. That is why a 256-bit hash gives only 128-bit collision security.
2) Merkle-Damgard construction
Many classic hashes (MD5, SHA-1, SHA-256) are built using Merkle-Damgard:
- Pad the message to a multiple of the block size.
- Iterate a compression function over fixed-size blocks.
- The final chaining value is the digest.
This gives streaming and incremental hashing: you can hash data as it arrives. But it also introduces a structural weakness: length extension.
3) Padding: the hidden part of the message
Merkle-Damgard padding is precise:
- Append a 1 bit (0x80).
- Append 0 bits until the message length is 56 mod 64 bytes.
- Append the original length (in bits) as a 64-bit big-endian integer.
If two different messages result in the same padded blocks, the hash will match. That is why you must use a canonical encoding before hashing.
4) Length extension attacks
If a system uses a naive MAC like:
MAC = H(key || message)
an attacker who knows MAC and message can compute a valid MAC for message || padding || suffix without knowing the key. This is a length extension attack and it breaks many homegrown protocols.
The fix is HMAC or any construction that hides the internal state and breaks the "continue hashing" property.
5) HMAC vs "hash + key"
HMAC does:
HMAC = H((key xor opad) || H((key xor ipad) || message))
Because the outer hash is keyed, the attacker cannot continue hashing even if they know the inner digest. This is why HMAC is safe against length extension.
6) Merkle trees
A Merkle tree lets you commit to a set of leaves with a single root hash.
- Leaves are hashed items.
- Internal nodes hash the concatenation of child hashes.
- A proof for one leaf is the list of sibling hashes from leaf to root.
Properties:
- Root changes if any leaf changes.
- Proof size is O(log N).
- Verification is O(log N).
These trees power blockchains, file synchronization, and tamper-evident logs.
7) Domain separation and encoding
Hash inputs must be unambiguous. If you hash concatenations, you must separate domains and encode lengths to avoid collisions like:
H("ab" || "c") == H("a" || "bc")
Fixes:
- Prefix with a tag (0x00 for leaf, 0x01 for node).
- Include lengths in a fixed-width format.
- Use structured serialization with canonical rules.
8) Practical pitfalls
- Hashing structured data without canonicalization (e.g., JSON with different key orders).
- Using a raw hash as a MAC (length extension risk).
- Reusing hashes across protocols without domain separation.
- Assuming collision resistance implies preimage resistance (it does not).
What you will build
- SHA-256 from scratch: padding, message schedule, compression, and digest output.
- Merkle root + proofs: build a tree and verify membership proofs.
- Length extension attack: forge a MAC without the secret key.
- Hash commitments: build a binding/hiding commitment with domain separation.
Key takeaways
- Hash security is about precise properties and attack costs.
- Merkle-Damgard gives streaming but enables length extension.
- HMAC exists for a reason. Do not roll your own MAC.
- Merkle trees give logarithmic proofs and tamper-evident structure.
- Encoding and domain separation prevent real-world collisions.
Module Items
SHA-256 From Scratch
Implement the SHA-256 hash function without using crypto libraries.
Merkle Root and Proofs
Build Merkle roots, generate proofs, and verify membership.
Length Extension Attack
Forge a secret-prefix MAC using a length extension attack.
Hash-Based Commitments
Create and verify a salted hash commitment with domain separation.
Hash Functions & Merkle Structures Checkpoint
Collision/preimage resistance, Merkle-Damgard, length extension, and Merkle proofs.