I fully expect I have a bug somewhere or am misunderstanding something, but why does the following code not appear to exhibit uniform distribution?
func TestMD5(t *testing.T) {
n := 50000
counts := map[uint32]int{} // # of hashes per 1/nth shard
for i := 0; i < n; i++ {
hash := md5.Sum(newUUID())
result := binary.BigEndian.Uint32(hash[:4])
counts[result/uint32(n)]++
}
dupeShards := 0
dupeEntries := 0
for _, count := range counts {
if count > 1 {
dupeShards++
dupeEntries += count - 1
}
}
t.Logf("%d inputs hashed to the same %d shards as other inputs.", dupeEntries, dupeShards)
if len(counts) < n*95/100 {
t.Fatalf("%d populated shards not within 5%% of expected %d uniform distribution!", len(counts), n)
}
}
https://play.golang.org/p/05mA0Dl9GBG
—
Explanation of code:
==> I'd expect the 50k MD5 sums to be ~evenly distributed across the 50k shards, but I consistently see only ~38k shards populated, with clumping in ~10k of the shards:
main.go:29: 12075 inputs hashed to the same 9921 shards as other inputs.
main.go:32: 37925 populated shards not within 5% of expected 50000 uniform distribution!
I can repro this with other hashes too (e.g. FNV), so I'm guessing I'm misunderstanding something. Thank you for the help!
This is absolutely normal behavior, and doesn't show any bias or incorrectness of the MD5 implementation.
What you are doing is (very close to) taking 50,000 random numbers between 0 and 49,999. When you do this, it's almost certain that many of the numbers will be repeated, and therefore that some numbers won't appear. It would in fact be very unlikely that the 50,000 numbers should all be different with absolutely no repetitions.
You can test this with a six-sided dice - if you throw it 6 times, you're very unlikely to get all six numbers, and much more likely to see around 3, 4 or 5 of them, with one, two or three repetitions. It's also related to the so-called birthday paradox.
Another example of this phenomenon is the 'Panini sticker question'. A Panini sticker album is a book with space for around 600 football stickers which commemorate the World Cup of soccer. Each one is numbered and different, and they feature randomly in packets. You have to get one of each number to complete the album. Suppose that you bought exactly the right number of stickers to fill the album. It would be extremely lucky if you were able to fill the album perfectly, without having any doubles or missing stickers. In fact you have to buy on average a large multiple of the number of stickers in order to get at least one of each (if you don't swap duplicates with other collectors).
The number of different values 0-49,999 which appear and the number which show 'clumping' can be calculated mathematically. I'm not sure exactly how you measure clumping. But the value of 38K populated values will be quite stable from one trial to the next, even though the actual values you see will change.
In fact, the expected number of populated values is (1 - 1/e)n, where n is the number of possible values, and e is the mathematical constant 2.718281828... The answer for n=50000 is 31606. You won't always get this value of course, but all results should be within a few hundred or so (spitballing here). You made a slight mistake in your program so I haven't been able to decipher the relevant calculation that gives you ~37000.