I have user-generated strings coming in at an undefined rate, some of which is duplicate data, and I'd like to keep the count of the top-20 most common duplicates in real time, over a given constant time period (e.g., over the last hour), in Go.
The number of unique strings is not limited in any way, so, to avoid DoS, the datastructure probably has to have a defined size of the most number of elements (e.g., top-10k-elements and/or 1MB overall size), and drop the least recently inserted elements if they don't have any duplicates yet (but never drop any newly incoming elements!).
My understanding is that this is exactly how ngx_http_limit_req_module.c
is implemented, and this method is referred to in the documentation as “leaky bucket”, however, the wikipedia page appears to suggest that it's the new data that'd be dropped from the queue, not the old one, so, not sure if the concept applies.
Regardless, I tried looking for a “leaky bucket” implementation in Golang, and by far the most popular result I found is uber-go/ratelimit
, whose API doesn't appear to fit my problem statement at all — it simply implements some actual rate-limiting queue, not a top X over the last Y count in realtime.
Can anyone suggest the proper name for what I'm looking for, and the best way to accomplish this, preferably in Go?
This is two problems.
For the first problem I'd recommend keeping track per name, per minute, how many there were. As they finish, add them to a running total, and add to a queue of things to subtract back out in an hour. This gives you 60 small objects per name, and on a running basis you'll be keeping a hash going.
The second problem is more challenging. For this I'd use a probabilistic approach. The idea is that each name is hashed with a unique id, and you keep only the thousand smallest hash values (and associated names) that you see in a minute. (I'll give an algorithm for that in a minute.) Your hash values should be evenly distributed among the 2^64 largest independently of the name, so common names will eventually show up in this list. When they do, you start counting them! (You will lose the first few, but with more work you can estimate how many you missed. This optimization may be more work than it is worth though.)
Now how do we keep the thousand smallest hash values? You use a priority queue, which is often implemented with a heap to create an updatable data structure where it is easy to pull out the largest hash value. So you run the following pseudo-code.
create your priority queue of (hash, name)
for each name:
hash hash of name and unique new id
entry = (hash, name)
if queue size < 1000:
insert entry
else if hash is smaller than the current max in the queue
insert entry
remove the largest entry