Why MD5 still won't die — and where it's actually OK to use

5 min read #hash #md5 #security #engineering

If you Google “is MD5 broken,” every result on page one says yes. And they’re right — collisions in MD5 can be found in seconds on a modern laptop.

Yet MD5 ships in production code at every company I’ve ever consulted for. AWS S3 uses it for ETags. Git LFS uses it for chunking. Backup deduplication tools (Borg, restic) consider it just fine for fingerprinting. Half of all CDN cache keys are MD5.

This isn’t because everyone is bad at security. It’s because MD5 solves a different problem than the one it’s “broken” for.

What’s actually broken about MD5

The 2004 attack (Wang et al.) showed how to find two distinct inputs that produce the same MD5 hash. By 2008, this was reduced from minutes to seconds. By 2013, you could craft two PDFs with the same MD5 that displayed totally different content.

This breaks MD5 for any use where:

An attacker can choose the input.
The hash is being used as a unique identity (signature, certificate, content address).
The attacker benefits from producing two inputs with the same hash.

Translation: MD5 is broken for cryptographic use.

It is not broken for the much broader case of “given two random inputs that an attacker did not choose, how often do they collide by accident?” That probability is still about 1 in 2⁶⁴ for a random pair — about as unlikely as picking the same atom twice from a gram of carbon.

Where MD5 is genuinely fine

1. ETags and HTTP cache keys

S3 sets ETag: "<md5>" on most uploads. CloudFlare uses it for cache invalidation. NGINX has an ngx_http_etag_module that’s MD5-based.

The threat model: a CDN caches a response. When the upstream returns a fresh response with a different ETag, the CDN refetches. The attacker is not in the loop. Even if they could somehow craft two identical-MD5 responses, all that happens is one of them gets served. The integrity guarantee comes from elsewhere.

2. Backup deduplication

Borg, restic, BorgBackup, restic-shipped backup tools all use SHA-256 for the cryptographic chunk, but identify duplicates with a fingerprint that’s often shorter (xxhash or truncated MD5). When two chunks have identical fingerprints, a deeper byte comparison confirms. MD5 here is just a fast Bloom-filter-like first-pass.

3. UUID v3 — deterministic UUIDs

UUID version 3 is defined as MD5(namespace UUID + name). It’s deterministic — same input always produces the same UUID. Useful for generating UUIDs from URLs or other naming systems. Nobody is attacking the namespace.

(See guid.tooljo.com/uuid-v3 for this in action.)

4. Cache busting / asset fingerprinting

Webpack, Vite, esbuild emit app.[md5].js filenames. The hash is just “did the bytes change since the last build.” If two different builds collide, the bug is “stale cache” — annoying, not exploitable.

5. File-level deduplication in storage systems

Storage backends use MD5 to deduplicate identical files. Same MD5 → same content → store once. Same threat model as backup dedup: nobody benefits from collisions in their own data.

Where MD5 is dangerous (the other side of the line)

Verifying a download against malicious tampering. If the publisher is honest but the mirror is hostile, MD5 was OK. If the publisher themselves wants to trick you (or the publisher is compromised), MD5 fails — they can craft a malicious file with the same MD5 as the honest one. Use SHA-256 + signed checksums.
Password storage. SHA-256 is fast; MD5 is even faster. Both are bad for passwords. Use argon2id, scrypt, or bcrypt.
Digital signatures. MD5-based signatures (still found in old PKI) are forgeable.
Code-signing, certificate signing. Migrate now if you haven’t.
Anywhere “we just want a hash, surely MD5 is fine” — without thinking about whether an attacker is in the loop.

How to use MD5 without leaking foot-guns

When you reach for MD5 in 2026, three rules:

Document the threat model in a comment. “MD5 used here for ETag generation; not security-sensitive.” Future-you will need to know whether an inherited “we always SHA-256 now” migration applies.
Don’t call it “secure” or “verified” in user-facing copy. Call it a checksum, fingerprint, or content hash.
Make the algorithm easy to swap. A const HASH_ALGO = 'md5' constant in one place beats a hundred hardcoded crypto.createHash('md5') calls.

What’s likely to actually replace MD5

For non-security high-throughput hashing, the modern alternatives are:

xxhash (xxh3, xxh64, xxh128) — designed for speed. Multiple GB/s on a single core. Drop-in replacement for MD5 in most non-security uses.
BLAKE3 — cryptographically secure and faster than xxhash on long inputs. Useful when you might want to upgrade your threat model later.
CRC32 — for very-low-cost change detection where collisions are tolerable. Common in databases.

Most projects don’t migrate MD5 because the cost of “switch hash algorithm everywhere” outweighs the marginal performance gain. Until the next time someone confuses non-security MD5 with security MD5 in review, anyway.

TL;DR

MD5 is broken for security. It’s been broken for 20 years.
MD5 is fine for the non-security hashing it’s actually used for in production: ETags, dedup, fingerprinting, cache keys, UUID v3.
The mistake isn’t using MD5; it’s using it for security purposes without realising it.
For new code, default to SHA-256 unless you have a specific reason to choose otherwise. See which hash function to use for the decision tree.
Try the hash calculator to compute MD5 (and everything else) side-by-side.