Have a look at rendezvous hashing (https://en.wikipedia.org/wiki/Rendezvous_hashing). It's simpler, and more general than 'consistent hashing'. Eg you don't have to muck around with virtual nodes. Everything just works out, even for small numbers of targets.
The basic idea is that you can vastly improve on random assignment for load balancing by instead picking two servers at random, and assigning to the less loaded one.
It's an interesting topic in itself, but there's also ways to combine it with consistent hashing / rendezvous hashing.
I also double that rendezvous hashing suggestion. Article mentions that it has O(n) time where n is number of nodes. I made a library[1] which makes rendezvous hashing more practical for a larger number of nodes (or weight shares), making it O(1) amortized running time with a bit of tradeoff: distributed elements are pre-aggregated into clusters (slots) before passing them through HRW.
Does it really matter? Here, n is a very small number, which is almost a constant. I'd assume the iteration over the n space is negligible compared to the other parts of a request to a node.
> if you are into load balancing, you might also want to look into the 'power of 2 choices'.
You can do that better if you don't use a random number for the hash, instead flip a coin (well, check a bit of the hash of a hash), to make sure hash expansion works well.
This trick means that when you go from N -> N+1, all the keys move to the N+1 bucket instead of being rearranged across all of them.
I've seen this two decades ago and after seeing your comment, felt like getting Claude to recreate what I remembered from back then & write a fake paper [1] out of it.
See the MSB bit in the implementation.
That said, consistent hashes can split ranges by traffic not popularity, so back when I worked in this, the Membase protocol used capacity & traffic load to split the virtual buckets across real machines.
Hot partition rebalancing is hard with a fixed algorithm.
Ceph storage uses a hierarchical consistent hashing scheme called "CRUSH" to handle hierarchical data placement and replication across failure domains. Given an object ID, its location can be calculated, and the expected service queried.
As a side effect, it's possible to define a logical topology that reflects the physical layout, spreading data across hosts, racks, or by other arbitrary criteria. Things are exactly where you expect them to be, and there's very little searching involved. Combined with a consistent view of the cluster state, this avoids the need for centralized lookups.
It's used as the index for a simple KV store I did as an interview problem awhile back, it pretty handily does 500k inserts/s and 5m reads/s and it's nothing special (basic write coalescing, append-only log):
Another strategy to avoid redistribution is simply having a big enough number of partitions and assign ranges instead of single partitions. A bit more complex on the coordination side but works well in other domains (distributed processing for example)
You're missing that the hash space is not divided uniformly. Which means one can vary the number of slots without recomputing the hash space division -- and without reassigning all of the existing entries.
Have a look at rendezvous hashing (https://en.wikipedia.org/wiki/Rendezvous_hashing). It's simpler, and more general than 'consistent hashing'. Eg you don't have to muck around with virtual nodes. Everything just works out, even for small numbers of targets.
It's also easier to come up with an exact weighted version of rendezvous hashing. See https://en.wikipedia.org/wiki/Rendezvous_hashing#Weighted_re... for the weighted variant.
Faintly related: if you are into load balancing, you might also want to look into the 'power of 2 choices'. See eg https://www.eecs.harvard.edu/~michaelm/postscripts/mythesis.... or this HN discussion at https://news.ycombinator.com/item?id=37143376
The basic idea is that you can vastly improve on random assignment for load balancing by instead picking two servers at random, and assigning to the less loaded one.
It's an interesting topic in itself, but there's also ways to combine it with consistent hashing / rendezvous hashing.
I also double that rendezvous hashing suggestion. Article mentions that it has O(n) time where n is number of nodes. I made a library[1] which makes rendezvous hashing more practical for a larger number of nodes (or weight shares), making it O(1) amortized running time with a bit of tradeoff: distributed elements are pre-aggregated into clusters (slots) before passing them through HRW.
[1]: https://pkg.go.dev/github.com/SenseUnit/ahrw
Does it really matter? Here, n is a very small number, which is almost a constant. I'd assume the iteration over the n space is negligible compared to the other parts of a request to a node.
Yes, different applications have different trade-offs.
> if you are into load balancing, you might also want to look into the 'power of 2 choices'.
You can do that better if you don't use a random number for the hash, instead flip a coin (well, check a bit of the hash of a hash), to make sure hash expansion works well.
This trick means that when you go from N -> N+1, all the keys move to the N+1 bucket instead of being rearranged across all of them.
I've seen this two decades ago and after seeing your comment, felt like getting Claude to recreate what I remembered from back then & write a fake paper [1] out of it.
See the MSB bit in the implementation.
That said, consistent hashes can split ranges by traffic not popularity, so back when I worked in this, the Membase protocol used capacity & traffic load to split the virtual buckets across real machines.
Hot partition rebalancing is hard with a fixed algorithm.
[1] - https://github.com/t3rmin4t0r/magic-partitioning/blob/main/M...
> This trick means that when you go from N -> N+1, all the keys move to the N+1 bucket instead of being rearranged across all of them.
Isn't that how rendezvous hashing (and consistent hashing) already work?
The typo is really really bothering me, because the future generations would not be able to search for it.
You can get things like this fixed with the Contact link at the bottom of the page (I just emailed them about it).
It's so much better to copy and paste the title of articles.
They seem to have fixed the title. It looks wrong only here on HN now.
Nice, thanks Dang.
Can't mention this without mentioning Akamai founder Lewin, who had a sad ending.
https://en.wikipedia.org/wiki/Daniel_Lewin
Wow I didn't know this history about Akamai, thanks for mentioning, interesting as a former Linode guy and a fan of consistent hashing.
Ceph storage uses a hierarchical consistent hashing scheme called "CRUSH" to handle hierarchical data placement and replication across failure domains. Given an object ID, its location can be calculated, and the expected service queried.
As a side effect, it's possible to define a logical topology that reflects the physical layout, spreading data across hosts, racks, or by other arbitrary criteria. Things are exactly where you expect them to be, and there's very little searching involved. Combined with a consistent view of the cluster state, this avoids the need for centralized lookups.
The original paper is a surprisingly short read: https://ceph.com/assets/pdfs/weil-crush-sc06.pdf DOI: 10.1109/SC.2006.19
A final mention of the “simplifying” Lamping-Veach algorithm would have been great: https://arxiv.org/ftp/arxiv/papers/1406/1406.2294.pdf?ref=fr...
https://www.metabrew.com/article/libketama-consistent-hashin...
Ketama implementation of consistent hashing algorithm is really intuitive and battle tested.
s/Constitent/Consistent/
Unless it's a clever play on "consistent", that is. In which case: carry on.
Shameless plug for my super simple consistent-hashing implementation in clojure: https://github.com/ryuuseijin/consistent-hashing
Was the HN-post title also hashed? (It’s inconstitent with the actual title)
I've implemented a cache-line aware (from a paper) version of a persistent, consistent hashing algorithm that gets pretty good performance on SSDs:
https://github.com/chiefnoah/mehdb
It's used as the index for a simple KV store I did as an interview problem awhile back, it pretty handily does 500k inserts/s and 5m reads/s and it's nothing special (basic write coalescing, append-only log):
https://git.sr.ht/~chiefnoah/keeeeyz/tree/meh
seems worth fixing the spelling mistake here - this is a consistent hashing post (currently "constitent hashing")
Another strategy to avoid redistribution is simply having a big enough number of partitions and assign ranges instead of single partitions. A bit more complex on the coordination side but works well in other domains (distributed processing for example)
Is it just me or can you describe the whole scheme in one sentence?
tl;dr: subdivide your hash space (say, [0, 2^64)) by the number of slots, then utilize the index of the slot your hash falls in.
Or, in another sense: rely on / rather than % for distribution.
Is this accurate or am I missing something?
You're missing that the hash space is not divided uniformly. Which means one can vary the number of slots without recomputing the hash space division -- and without reassigning all of the existing entries.
I must've totally misunderstood what I read then. I'll give it another read, thanks!
That's the naive method which tends to redistribute most objects when the number of slots changes.