NUMA can cause really crappy performance. We deployed a Go based LLM gateway in Kubernetes deployed on a server with hundreds of CPU cores. We didn't explicitly set GOMAXPROCS so Go runtime scheduled goroutines over different CPUs and it constantly used 200% CPU and GC was causing latency spikes. Then we set GOMAXPROCS 8 and all performance issues went away. Until recently Kubernetes didn't work well with NUMA.
Heck, we saw crazy performance degradation with redis when its memory usage exceeded a single NUMA block. Not much to be done about that at the k8s level when redis is single-threaded. Have to be super conscious of the underlying hardware at that point.
Intel suffers just as much when NUMA enters the picture, even prior to CCD style architecture. That extra latency hop across to the other core to get at memory is absolutely crippling, especially in a hot loop. It requires very careful handling, while being this kind of invisible element (unless you know to look for it, nothing will draw your attention to it)
NUMA is one of those amazing things that trip you up in all sorts of ways at unexpected times. The amazing "invisible" performance killler (invisible because unless you're already aware of NUMA, or remember to check, you won't know it's there potentially crippling you.)
It has been a source of routine conversations with customers and engineers of all kinds, and often one of those things you don't know about until too late.
I don't know if the kernel has improved this behaviour in the several years since last tested, but a coworker realised that the linux page-cache wasn't fully split by NUMA node. They were benchmarking mysql running it in each NUMA node, and noticed the second NUMA node was noticeably slower. Then discover after a reboot the second node was fast, and the first was slower. After a bit of thinking and tinkering they discovered that libmysql was ending up in the page cache in the same NUMA node as the benchmark client was run in first, so even though they were pinning the benchmark tool and mysql process to the NUMA node, the benchmark client was causing the OS to reach across the NUMA node to get at the page cached library.
It's unfortunately a hard message to sell, but your program itself should never be backed by a file, for this exact reason. All code should be remapped into anonymous memory that is 1) right where you want it; 2) backed by hugepages; and 3) not shared. You are gaining nothing by trying to share objects with other processes, and you are losing a great deal of performance. The tradeoff made a little sense in the 1980s when they came up with it but in this century it simply doesn't.
Something I didn’t see mentioned was that this unequal memory access time also affects pcie I/O. If your thread on CPU A needs to get data in or out of a nic on CPU B, your throughput/latency will be impacted.
We have to explain this to customers of our software all the time, it’s something that’s easy to miss.
When building Edera (product from article), I also had the added problem of the virtual networking gap where I was bridging a 10Gbit NIC over a virtual interface, and I had weird performance bouncing between 3Gbit and the full 10Gbit. Luckily I had built networking drivers before and knew the complexities of it, and managed to profile it down to the virtual interface getting worst-case NUMA occasionally.
The part 2 is going to cover how we actually solved it, which involves every part of the system having knowledge. It's so easy to ignore but it has a massive impact on perf.
Complete slop from start to finish. Some of the distinctions manufactured by the LLM author are nonsensical: “a thread may run on node 0 but access data on node 1. Conversely, it could also run on node 1 but access data on node 0.” These cases are different how? Or “there are 3 possible cases. A thread might run on the right node but access data on the wrong node. Or it might access data on the right node but run on the wrong node. Or both.” WTF could this possibly mean? This hallucinated nonsense logic wastes the time of the reader and whoever posted it, and whoever submitted it anyway, should be ashamed of themselves. The prevalence of this garbage just makes it harder to find accurate sources on topics of interest. They are certainly out there, but it’s getting harder and harder to find them.
I'm baffled by the fact that NUMA is still an issue in 2026. My impression is that this was all solved back in dotcom era already on those big SUNs. At least in HPC we solved this already in mid 2000s. Why is supposedly modern world still wasting time on this? Kernel these days exposes just about everything you would ever want to know about a system topology and every runtime should be making use of that information. If it does not, I cannot consider it ready for this century.
Because numa topology is an optimisation problem with a wide solution space, and that its configuration and setup depends on the amount of physical cpus and cores; how the RAM is connected to which lanes; and on and on it goes.
> If it does not, I cannot consider it ready for this century.
Yeah, when you have tall servers this can be a really surprising factor. In some sense you could view this as an extension of processor caching behaviors, which also causes some memory accesses to be lower - just due to cache behaviors, not physical location. But in many cases, the same tools can be used to fight both "far" memory accesses and cache trashing, by using a thread-isolated architecture.
I have been dealing with the topic for a few years now and it was surprisingly hard to track down the bottlenecks to actual numbers. Some time ago I managed to find a good example to demonstrate the effect in a tangible way and wrote up an article about it. If the topic sounds interesting, you might enjoy https://sander.saares.eu/2025/03/31/structural-changes-for-4... (Structural changes for +48-89% throughput in a Rust web service).
My question: why do mainstream users tolerate NUMA? 99% of you don't need to. Single-socket servers exist and they are not only tolerable but better in most ways. Dealing with NUMA in software consists of trying to logically partition the machine, but you can instead physically partition the machine. It's so much simpler!
Amazon gets this. Except for the 4th generation their Graviton systems are not NUMA.
NUMA latencies across machines are way worse than across sockets or across core complexes. :p
Single socket doesn't necessarily get you away from NUMA anyway, AMD server sockets are 4 way NUMA (you can set it for interleaving, but you could do better with NUMA-aware software), and I think Intel is doing NUMA on server socket as well.
A lot of people like to take one big machine and partition it into several smaller virtual machines. In that case, it shouldn't be too hard to partition vms into NUMA zones? Only vms that are two big to fit in one zone have to worry about it (or that need to be repacked into a different zone)
AMD's NPS4 mode isn't exactly user-friendly, I agree. But you can put it into NPS1 mode and relax. Graviton 5, as a counterpoint, doesn't give you the option. Physically there is a 2D mesh between the cores and the memory controller but the observable behavior is that every access gets the average mesh fabric latency. The efficiency you leave on the table isn't very large, whereas in multisocket NUMA you can't ignore the cost.
I think you can over-analyze this stuff and lose your sanity. On these multicore systems there are also hot cores in the center of the mesh and cold ones at the edges and theoretically you could be doing temperature-aware scheduling, gaining a bit more efficiency in doing so. But it's just easier to adopt the black box model of spherical frictionless CPUs.
NUMA can cause really crappy performance. We deployed a Go based LLM gateway in Kubernetes deployed on a server with hundreds of CPU cores. We didn't explicitly set GOMAXPROCS so Go runtime scheduled goroutines over different CPUs and it constantly used 200% CPU and GC was causing latency spikes. Then we set GOMAXPROCS 8 and all performance issues went away. Until recently Kubernetes didn't work well with NUMA.
Heck, we saw crazy performance degradation with redis when its memory usage exceeded a single NUMA block. Not much to be done about that at the k8s level when redis is single-threaded. Have to be super conscious of the underlying hardware at that point.
> Kubernetes deployed on a server with hundreds of CPU cores
Was that a Power9 or some sort of IBM machine?
Not all NUMA is the same, ccNUMA from the Intel is a different beast from the PPC version of the same.
Is this on AMD? I wonder if it's all to do with NUMA or their CCD architecture etc (well these days Intel and everyone also does it to some extent).
Intel suffers just as much when NUMA enters the picture, even prior to CCD style architecture. That extra latency hop across to the other core to get at memory is absolutely crippling, especially in a hot loop. It requires very careful handling, while being this kind of invisible element (unless you know to look for it, nothing will draw your attention to it)
Hundreds of cores is likely two sockets and so you've got NUMA there.
Scaling to large core counts has a lot of gotchas.
There is one instance where the NUMA performance never disappoints: https://www.youtube.com/watch?v=Cqd1Gvq-RBY
There are in fact two instances https://www.youtube.com/watch?v=ZBKm1MBsTbk
NUMA is one of those amazing things that trip you up in all sorts of ways at unexpected times. The amazing "invisible" performance killler (invisible because unless you're already aware of NUMA, or remember to check, you won't know it's there potentially crippling you.)
It has been a source of routine conversations with customers and engineers of all kinds, and often one of those things you don't know about until too late.
I don't know if the kernel has improved this behaviour in the several years since last tested, but a coworker realised that the linux page-cache wasn't fully split by NUMA node. They were benchmarking mysql running it in each NUMA node, and noticed the second NUMA node was noticeably slower. Then discover after a reboot the second node was fast, and the first was slower. After a bit of thinking and tinkering they discovered that libmysql was ending up in the page cache in the same NUMA node as the benchmark client was run in first, so even though they were pinning the benchmark tool and mysql process to the NUMA node, the benchmark client was causing the OS to reach across the NUMA node to get at the page cached library.
It's unfortunately a hard message to sell, but your program itself should never be backed by a file, for this exact reason. All code should be remapped into anonymous memory that is 1) right where you want it; 2) backed by hugepages; and 3) not shared. You are gaining nothing by trying to share objects with other processes, and you are losing a great deal of performance. The tradeoff made a little sense in the 1980s when they came up with it but in this century it simply doesn't.
Something I didn’t see mentioned was that this unequal memory access time also affects pcie I/O. If your thread on CPU A needs to get data in or out of a nic on CPU B, your throughput/latency will be impacted.
We have to explain this to customers of our software all the time, it’s something that’s easy to miss.
Same. The drop in performance can be surprisingly bad. 10Gbps becomes 5Gbps. 100Gbps becomes 20Gbps.
When building Edera (product from article), I also had the added problem of the virtual networking gap where I was bridging a 10Gbit NIC over a virtual interface, and I had weird performance bouncing between 3Gbit and the full 10Gbit. Luckily I had built networking drivers before and knew the complexities of it, and managed to profile it down to the virtual interface getting worst-case NUMA occasionally.
The part 2 is going to cover how we actually solved it, which involves every part of the system having knowledge. It's so easy to ignore but it has a massive impact on perf.
(CTO of Edera here)
Great point! We also try to factor that in as well.
Steven (the author) will cover that in part 2!
Complete slop from start to finish. Some of the distinctions manufactured by the LLM author are nonsensical: “a thread may run on node 0 but access data on node 1. Conversely, it could also run on node 1 but access data on node 0.” These cases are different how? Or “there are 3 possible cases. A thread might run on the right node but access data on the wrong node. Or it might access data on the right node but run on the wrong node. Or both.” WTF could this possibly mean? This hallucinated nonsense logic wastes the time of the reader and whoever posted it, and whoever submitted it anyway, should be ashamed of themselves. The prevalence of this garbage just makes it harder to find accurate sources on topics of interest. They are certainly out there, but it’s getting harder and harder to find them.
I'm baffled by the fact that NUMA is still an issue in 2026. My impression is that this was all solved back in dotcom era already on those big SUNs. At least in HPC we solved this already in mid 2000s. Why is supposedly modern world still wasting time on this? Kernel these days exposes just about everything you would ever want to know about a system topology and every runtime should be making use of that information. If it does not, I cannot consider it ready for this century.
Because numa topology is an optimisation problem with a wide solution space, and that its configuration and setup depends on the amount of physical cpus and cores; how the RAM is connected to which lanes; and on and on it goes.
> If it does not, I cannot consider it ready for this century.
Mhmm.
Yeah, when you have tall servers this can be a really surprising factor. In some sense you could view this as an extension of processor caching behaviors, which also causes some memory accesses to be lower - just due to cache behaviors, not physical location. But in many cases, the same tools can be used to fight both "far" memory accesses and cache trashing, by using a thread-isolated architecture.
I have been dealing with the topic for a few years now and it was surprisingly hard to track down the bottlenecks to actual numbers. Some time ago I managed to find a good example to demonstrate the effect in a tangible way and wrote up an article about it. If the topic sounds interesting, you might enjoy https://sander.saares.eu/2025/03/31/structural-changes-for-4... (Structural changes for +48-89% throughput in a Rust web service).
My question: why do mainstream users tolerate NUMA? 99% of you don't need to. Single-socket servers exist and they are not only tolerable but better in most ways. Dealing with NUMA in software consists of trying to logically partition the machine, but you can instead physically partition the machine. It's so much simpler!
Amazon gets this. Except for the 4th generation their Graviton systems are not NUMA.
NUMA latencies across machines are way worse than across sockets or across core complexes. :p
Single socket doesn't necessarily get you away from NUMA anyway, AMD server sockets are 4 way NUMA (you can set it for interleaving, but you could do better with NUMA-aware software), and I think Intel is doing NUMA on server socket as well.
A lot of people like to take one big machine and partition it into several smaller virtual machines. In that case, it shouldn't be too hard to partition vms into NUMA zones? Only vms that are two big to fit in one zone have to worry about it (or that need to be repacked into a different zone)
AMD's NPS4 mode isn't exactly user-friendly, I agree. But you can put it into NPS1 mode and relax. Graviton 5, as a counterpoint, doesn't give you the option. Physically there is a 2D mesh between the cores and the memory controller but the observable behavior is that every access gets the average mesh fabric latency. The efficiency you leave on the table isn't very large, whereas in multisocket NUMA you can't ignore the cost.
I think you can over-analyze this stuff and lose your sanity. On these multicore systems there are also hot cores in the center of the mesh and cold ones at the edges and theoretically you could be doing temperature-aware scheduling, gaining a bit more efficiency in doing so. But it's just easier to adopt the black box model of spherical frictionless CPUs.