From 300KB to 69KB per Token: How LLM Architectures Solve the KV Cache Problem

(news.future-shock.ai)

156 points | by future-shock-ai 5 days ago ago

16 comments

coppsilgold 3 days ago ago
There are also interesting approaches to more directly compress a large document or an entire codebase into a smaller set of tokens without getting the LLM to wing it. For example, Cartridges: <https://hazyresearch.stanford.edu/blog/2025-06-08-cartridges>
They basically get gradient descent to optimize the KV cache while freezing the network.
refulgentis 3 days ago ago
Good prose, but it keeps collapsing distinct layers of the stack into one poetic notion of “memory.” KV cache, prompt caching, product-level saved memory, transcript storage, retrieval, summarization, and long-context failure modes are different mechanisms with different failure modes. Once those boundaries disappear, you get lines like “API pricing is the price of remembering." Evocative, sure. Explanatory, not really.
Same thing in the technical bits.
“Computation drops from quadratic to linear” is only narrowly true for incremental decoding after the prefix is already processed.
“When the KV cache gets too large, the standard solution is compaction” is worse: the standard responses are boring systems tricks like limits, eviction, paging/offload, compression, etc. Summarization is usually an application workaround where you throw away old text and replace it with a shorter prompt. The cache never became a summary; the prompt did.
So I wouldn’t call the piece wrong so much as aggressively smooth. It knows the vocabulary, but it keeps letting metaphor outrun mechanism.
[-]
- nstj 2 days ago ago
  concur - a lot of the article was useful but a lot of it was "sorta the right stuff in sorta the wrong place"
- 2 days ago ago
  [deleted]
LuxBennu 3 days ago ago
good overview of the architecture side but worth mentioning there's another axis that stacks on top of all of this: you can quantize the kv cache itself at inference time. in llama.cpp you can run q8 for keys and q4 for values and it cuts cache memory roughly in half again on top of whatever gqa or mla already saves you. i run qwen 70b 4-bit on m2 max 96gb and the kv quant is what actually made longer contexts fit without running out of unified memory. keys need more precision because they drive attention scores but values are way more tolerant of lossy compression, so the asymmetry works out.
[-]
- suprjami 3 days ago ago
  Some models really suffer badly from KV quantisation. You can also take a speed hit using dissimilar K and V types.
  TurboQuant seems to be the next big thing in context memory usage. Polar coordinates achieving ~5x reduction in memory usage with minimal/no quality loss, and even a slight speedup in some cases.
  [-]
  - LuxBennu 2 days ago ago
    yeah fair point, it's definitely model dependent. i've had good results with qwen but tried it on a smaller mistral variant once and the output quality dropped noticeably even at q8 for both. the speed hit from mixed types hasn't been bad on apple silicon in my experience but i can see it mattering more on cuda.
  - hrmtst93837 2 days ago ago
    TurboQuant looks slick, but aggressive KV quantization can wreck attention-sensitive tasks, and summarization tends to degrade long before token-accuracy benchmarks expose the damage. Then you get ghost bugs.
  - Ecko123 2 days ago ago
    [dead]
az09mugen 3 days ago ago
Unrelated, but 69KB is how much RAM Voyager 1 has.
[-]
- gregman1 3 days ago ago
  Voyager as a token of curiosity
sachamorard 2 days ago ago
The compaction problem described here is worse than it looks because of the asymmetry between the compactor and the reader. The model doing the compaction has full access to everything, it can see all six rules in the policy, the exact budget figure, every constraint. The model reading the summary has no reference point to notice what's missing. There's no checksum on memory.
The article mentions the void between volatile KV cache and permanent weights. One thing that lives in that void: compression results. At Edgee we cache prompt compression outputs in a globally distributed KV store specifically to avoid recomputing them on every request. It maps naturally to the architecture, the cache is already the right abstraction, you're just caching one layer higher.
The interesting property is that compression results for similar contexts are often reusable across sessions, which the KV cache itself never is. The Greg Egan framing is apt. The trajectory from MHA to GQA to MLA reads exactly like a series of decisions about what's worth remembering in full fidelity vs. what can be abstracted. The difference is Egan's citizens chose their own compression ratios.
jasonjmcghee 2 days ago ago
> OpenAI applies it automatically and charges 50% less for cache hits
This is incorrect. It's 90% cheaper.
https://developers.openai.com/api/docs/pricing
algolint 2 days ago ago
[flagged]
2 days ago ago
[deleted]
childrapst 2 days ago ago
[dead]