Skip to content
100% in your browser. Nothing you paste is uploaded — all processing runs locally. Read more →

Why MD5 still won't die — and where it's actually OK to use

5 min read #hash #md5 #security #engineering

If you Google “is MD5 broken,” every result on page one says yes. And they’re right — collisions in MD5 can be found in seconds on a modern laptop.

Yet MD5 ships in production code at every company I’ve ever consulted for. AWS S3 uses it for ETags. Git LFS uses it for chunking. Backup deduplication tools (Borg, restic) consider it just fine for fingerprinting. Half of all CDN cache keys are MD5.

This isn’t because everyone is bad at security. It’s because MD5 solves a different problem than the one it’s “broken” for.

What’s actually broken about MD5

The 2004 attack (Wang et al.) showed how to find two distinct inputs that produce the same MD5 hash. By 2008, this was reduced from minutes to seconds. By 2013, you could craft two PDFs with the same MD5 that displayed totally different content.

This breaks MD5 for any use where:

Translation: MD5 is broken for cryptographic use.

It is not broken for the much broader case of “given two random inputs that an attacker did not choose, how often do they collide by accident?” That probability is still about 1 in 264 for a random pair — about as unlikely as picking the same atom twice from a gram of carbon.

Where MD5 is genuinely fine

1. ETags and HTTP cache keys

S3 sets ETag: "<md5>" on most uploads. CloudFlare uses it for cache invalidation. NGINX has an ngx_http_etag_module that’s MD5-based.

The threat model: a CDN caches a response. When the upstream returns a fresh response with a different ETag, the CDN refetches. The attacker is not in the loop. Even if they could somehow craft two identical-MD5 responses, all that happens is one of them gets served. The integrity guarantee comes from elsewhere.

2. Backup deduplication

Borg, restic, BorgBackup, restic-shipped backup tools all use SHA-256 for the cryptographic chunk, but identify duplicates with a fingerprint that’s often shorter (xxhash or truncated MD5). When two chunks have identical fingerprints, a deeper byte comparison confirms. MD5 here is just a fast Bloom-filter-like first-pass.

3. UUID v3 — deterministic UUIDs

UUID version 3 is defined as MD5(namespace UUID + name). It’s deterministic — same input always produces the same UUID. Useful for generating UUIDs from URLs or other naming systems. Nobody is attacking the namespace.

(See guid.tooljo.com/uuid-v3 for this in action.)

4. Cache busting / asset fingerprinting

Webpack, Vite, esbuild emit app.[md5].js filenames. The hash is just “did the bytes change since the last build.” If two different builds collide, the bug is “stale cache” — annoying, not exploitable.

5. File-level deduplication in storage systems

Storage backends use MD5 to deduplicate identical files. Same MD5 → same content → store once. Same threat model as backup dedup: nobody benefits from collisions in their own data.

Where MD5 is dangerous (the other side of the line)

How to use MD5 without leaking foot-guns

When you reach for MD5 in 2026, three rules:

  1. Document the threat model in a comment. “MD5 used here for ETag generation; not security-sensitive.” Future-you will need to know whether an inherited “we always SHA-256 now” migration applies.
  2. Don’t call it “secure” or “verified” in user-facing copy. Call it a checksum, fingerprint, or content hash.
  3. Make the algorithm easy to swap. A const HASH_ALGO = 'md5' constant in one place beats a hundred hardcoded crypto.createHash('md5') calls.

What’s likely to actually replace MD5

For non-security high-throughput hashing, the modern alternatives are:

Most projects don’t migrate MD5 because the cost of “switch hash algorithm everywhere” outweighs the marginal performance gain. Until the next time someone confuses non-security MD5 with security MD5 in review, anyway.

TL;DR