I'm sure it was a relief to find a thorough solution that addressed the root cause. But it doesn't seem plausible that it was fun while it was unexplained. When I have this kind of bug it eats my whole attention.
Something this deep is especially frustrating. Nobody suspects the standard library or the compiler. Devs have been taught from a young age that it's always you, not the tools you were given, and that's generally true.
One time, I actually did find a standard library bug. I ended up taking apart absolutely everything on my side, because of course the last hypothesis you test is that the pieces you have from the SDK are broken. So a huge amount of time is spent chasing the wrong lead when it actually is a fundamental problem.
On top of this, the thing is a race condition, so you can't even reliably reproduce it. You think it's gone like they did initially, and then it's back. Like cancer.
Some people are perverse individuals and actually enjoy debugging very esoteric things. What might be frustrating to you might be the very thing that gets someone else very excited.
It feels like this comment was almost a purely additive anecdote of your own experience with a similar kind of issue, but you've spoiled it by deciding to tell the author that they're incorrect about how they felt during the process?
That's an incredible find and once I saw the assembly I was right along with them on the debug path. Interestingly it doesn't need to be assembly for this to work, it's just that that's where the split was. The IR could've done it, it just doesn't for very good reasons. So another win for being able to read arm assembly.
Unsure if this would be another way to do it but to save an instruction at the cost of a memory access you could push then pop the stack size maybe? Since presumably you're doing that pair of moves on function entry and exit. I'm not really sure what the garbage collector is looking for so maybe that doesn't work, but I'd be interested to hear some takes on it
You would normally use the “LDR Rd, =expr” pseudo-instruction form [1]. For immediates not directly constructible, it puts a copy of the immediate value in a PC-relative memory location, then does a PC-relative load into register.
So that would turn the whole sequence of “add constant to SP” into 2 executable instructions, 1 for constructing immediate and 1 for adding for a total of 8 bytes, and a 4 byte data area for the 17-bit immediate for a total of 12 bytes of binary which is 3 executable instructions worth.
I'm a little surprised that this bug wasn't fixed in the assembler as a special case for immediate adds to RSP. If the patch was to the compiler only, other instances of the bug could be lurking out there in aarch64 assembly code.
Still, I find great that Go got back the 1990's tradition that compiled languages have an assembler as part of their tooling, regardless of the syntax.
I've never heard of that rule (though tbh I'm not allocating > 64KB of stack when I'm in assembly) and it seems Google hasn't either. While I'm sure it makes sense, I don't think I've ever seen that be enforced. At least in C/C++. Maybe it makes more sense for these stack inspecting garbage collectors but I've also heard of ones that just scan the stack without unwinding anything. I did a test asking Google's AI to generate a complicated C function, put it in godbolt, and there's plenty of push push push push ..... Pop Pop Pop Pop going on
Did you compile with optimisations? I think GCC will do a bunch of activity on the stack with -O0, but it'll generally coalesce everything into one push/pop per function with optimisations (not because of any rule, but just because it's faster). alloca and other dynamic stack allocation may break this, but normal variables should in pretty much all just get turned into one block on the stack (with appropriate re-use of space if variable lifetimes don't overlap)
Yeah but we have codegen bugs in .NET as well. The biggest difference that stood out to me in this write up, is we would have gone straight for “coredump” instead of other investigation tools. Our default mode of investigating memory corruption issues is dumps.
Sure, I have experienced them, e.g. once in 2006 using IBM's JVM implementation with Websphere.
However it is probably not as problematic due to the way Go allows for Assembly being used directly.
While the JVM and CLR don't allow for direct access to Assembly code, Go does, thus I assume expecting safepoints everywhere is not an option, as any subroutine call can land on code that was manually written.
I think the right fix is that the compiler should, e.g. load the constant into a register using two moves and then emit a single add. It's one more instruction, but then the adjustment is atomic (i.e. a single instruction). Another option is to do the arithmetic in a temp register and then move it back.
What ARM64 machines are you using and what are they used for? Last year you were announcing Gen 12 servers on AMD EPYC (https://blog.cloudflare.com/gen-12-servers/), but IIRC there weren’t any mentions of ARM64. But now it seems you’re running ARM64 in full production.
I guess those that wrote the preemption were on X86 where this doesn't happen thanks to variable length instructions being able to hold the constant and thus relied on the code-gen to do it atomically, then the ARM port had an automatic "split" from a higher level to make things "easy" thus giving us this bug.
Uh, the fault is entirely in writing an assembler _that is not an assembler_, but rather something that is _almost_ like one but then 1% like an IR instead. It's an unforced error.
Excellent article as always from the cloudflare blog - engineering without magic infrastructure and ml. One day I will apply !
Compiler bugs are actually quite common ( I used to find several a year in gcc ), but as the author says, some of them only appear when you work at a very large scale, and most people never dive that far.
Similar to the previous commenter, every time I read a blog post from Cloudflare I end up checking the careers page thinking "this is exactly the kind of work I'd like to be doing". Sadly no openings in my country. I'll keep checking!
Unfortunately, in 95% cases location IS a factor with bigger companies.
I'm in a similar position where I'd like to do something a lot more interesting, but intersection between where the interesting companies have offices and where I'd be willing to live do not really overlap enough justify rooting up my life.
(Unless we're talking about "too good to ignore", that's a different story.)
Seemed like a red herring. They were able to reproduce it without any libraries. Might have just been net link forcing the stacks to a certain size and that made the bug visible.
The real lesson here should be that doing crazy shit like swizzling the program counter in a signal handler and writing your own assembler is not a good idea.
Neither of those are "crazy shit." It's just complex because the environment offers specific features like automatic GC with async preemption in a compiled language which pretty much requires it.
Complex engineering isn't something to be avoided by default.
Sorry, how exactly do you think compilers are supposed to work if not by 'writing [their] own assembler'? Someone has to write the assembler, and different compilers have different needs.
The general wisdom is that you shouldn't do this stuff yourself, and you should instead rely on tried and tested implementations. But sometimes you're the one who provides the tried and tested implementations. Implementing a compiled language is often one of those times.
Great technical blog. Good pathway for narrative, tight examples, description so clear it makes me feel smarter than I am because so easy to follow though the last time I even read assembly seriously was x86 years ago.
Also, fulfills the marketing objective because I cannot help but think that this team is a bunch of hotshots who have the skill to do this on demand and the quality discipline to chase down rare issues.
I assume these are Ampere Altra? I was considering some of those for web servers to fill out my rack (more space than power) but ended up just going higher on power and using Epyc.
So it was. The article never mentions the frame pointer and I'm familiar with compilers that load the saved value from the stack in the epilog, rather than adjusting it arithmetically. But they do have an assembly listing showing the two-step arithmetic adjustment for both the stack pointer and frame pointer.
But I'm not sure that matters, because the unwind code they show uses the stack pointer rather than the frame pointer anyway.
> This was a very fun problem to debug.
I'm sure it was a relief to find a thorough solution that addressed the root cause. But it doesn't seem plausible that it was fun while it was unexplained. When I have this kind of bug it eats my whole attention.
Something this deep is especially frustrating. Nobody suspects the standard library or the compiler. Devs have been taught from a young age that it's always you, not the tools you were given, and that's generally true.
One time, I actually did find a standard library bug. I ended up taking apart absolutely everything on my side, because of course the last hypothesis you test is that the pieces you have from the SDK are broken. So a huge amount of time is spent chasing the wrong lead when it actually is a fundamental problem.
On top of this, the thing is a race condition, so you can't even reliably reproduce it. You think it's gone like they did initially, and then it's back. Like cancer.
Some people are perverse individuals and actually enjoy debugging very esoteric things. What might be frustrating to you might be the very thing that gets someone else very excited.
It feels like this comment was almost a purely additive anecdote of your own experience with a similar kind of issue, but you've spoiled it by deciding to tell the author that they're incorrect about how they felt during the process?
Maybe different people find different things fun.
Classic problem of non-atomic stack pointer modification.
Used to have a lot of fun with those 3 decades ago.
That's an incredible find and once I saw the assembly I was right along with them on the debug path. Interestingly it doesn't need to be assembly for this to work, it's just that that's where the split was. The IR could've done it, it just doesn't for very good reasons. So another win for being able to read arm assembly.
Unsure if this would be another way to do it but to save an instruction at the cost of a memory access you could push then pop the stack size maybe? Since presumably you're doing that pair of moves on function entry and exit. I'm not really sure what the garbage collector is looking for so maybe that doesn't work, but I'd be interested to hear some takes on it
You would normally use the “LDR Rd, =expr” pseudo-instruction form [1]. For immediates not directly constructible, it puts a copy of the immediate value in a PC-relative memory location, then does a PC-relative load into register.
So that would turn the whole sequence of “add constant to SP” into 2 executable instructions, 1 for constructing immediate and 1 for adding for a total of 8 bytes, and a 4 byte data area for the 17-bit immediate for a total of 12 bytes of binary which is 3 executable instructions worth.
[1] https://developer.arm.com/documentation/dui0801/l/A64-Data-T...
I'm a little surprised that this bug wasn't fixed in the assembler as a special case for immediate adds to RSP. If the patch was to the compiler only, other instances of the bug could be lurking out there in aarch64 assembly code.
> So another win for being able to read arm assembly.
Yes, though that weird stuff with dollars in it is not normal AArch64 assembly!
The article could have mentioned the "stack moves once" rule.
It is due to the Plan 9 Assembly dialect most likely, because it wasn't enough that we already have differences between AT&T and Intel.
https://go.dev/doc/asm
Still, I find great that Go got back the 1990's tradition that compiled languages have an assembler as part of their tooling, regardless of the syntax.
I've never heard of that rule (though tbh I'm not allocating > 64KB of stack when I'm in assembly) and it seems Google hasn't either. While I'm sure it makes sense, I don't think I've ever seen that be enforced. At least in C/C++. Maybe it makes more sense for these stack inspecting garbage collectors but I've also heard of ones that just scan the stack without unwinding anything. I did a test asking Google's AI to generate a complicated C function, put it in godbolt, and there's plenty of push push push push ..... Pop Pop Pop Pop going on
You need to look at non-x86 architectures. It was common years ago on MIPS.
* https://jdebp.uk/FGA/function-perilogues.html#StandardMIPS
I wrote up the x86 equivalent of doing just two read-modify-write operations on the stack pointer over 16 years ago.
* https://jdebp.uk/FGA/function-perilogues.html#Standardx86
Did you compile with optimisations? I think GCC will do a bunch of activity on the stack with -O0, but it'll generally coalesce everything into one push/pop per function with optimisations (not because of any rule, but just because it's faster). alloca and other dynamic stack allocation may break this, but normal variables should in pretty much all just get turned into one block on the stack (with appropriate re-use of space if variable lifetimes don't overlap)
Yes
Usually in runtimes like Java and .NET there are safepoints exactly to avoid changing context in the middle of a set of instructions.
Yeah but we have codegen bugs in .NET as well. The biggest difference that stood out to me in this write up, is we would have gone straight for “coredump” instead of other investigation tools. Our default mode of investigating memory corruption issues is dumps.
Sure, I have experienced them, e.g. once in 2006 using IBM's JVM implementation with Websphere.
However it is probably not as problematic due to the way Go allows for Assembly being used directly.
While the JVM and CLR don't allow for direct access to Assembly code, Go does, thus I assume expecting safepoints everywhere is not an option, as any subroutine call can land on code that was manually written.
I think the right fix is that the compiler should, e.g. load the constant into a register using two moves and then emit a single add. It's one more instruction, but then the adjustment is atomic (i.e. a single instruction). Another option is to do the arithmetic in a temp register and then move it back.
What ARM64 machines are you using and what are they used for? Last year you were announcing Gen 12 servers on AMD EPYC (https://blog.cloudflare.com/gen-12-servers/), but IIRC there weren’t any mentions of ARM64. But now it seems you’re running ARM64 in full production.
Always adjust your stack pointer atomically, kids.
I guess those that wrote the preemption were on X86 where this doesn't happen thanks to variable length instructions being able to hold the constant and thus relied on the code-gen to do it atomically, then the ARM port had an automatic "split" from a higher level to make things "easy" thus giving us this bug.
Nobodys fault really, but bad results ensued.
> Nobodys fault really, but bad results ensued.
Uh, the fault is entirely in writing an assembler _that is not an assembler_, but rather something that is _almost_ like one but then 1% like an IR instead. It's an unforced error.
Exactly what ran through my mind.
Excellent article as always from the cloudflare blog - engineering without magic infrastructure and ml. One day I will apply !
Compiler bugs are actually quite common ( I used to find several a year in gcc ), but as the author says, some of them only appear when you work at a very large scale, and most people never dive that far.
What's stopping you applying today?
Low compensation relative to many other companies. (It didn't stop me from applying, but I stopped me from accepting.)
Similar to the previous commenter, every time I read a blog post from Cloudflare I end up checking the careers page thinking "this is exactly the kind of work I'd like to be doing". Sadly no openings in my country. I'll keep checking!
Pretty sure location is not a factor for these companies. You should apply anyway. I’ve worked with people living in active war zones.
If you have the skills, they have the coin.
They won’t hire some react guy in X country but someone who can find compiler bugs and save them XX+ million dollars a year? Heck yeah.
Unfortunately, in 95% cases location IS a factor with bigger companies.
I'm in a similar position where I'd like to do something a lot more interesting, but intersection between where the interesting companies have offices and where I'd be willing to live do not really overlap enough justify rooting up my life.
(Unless we're talking about "too good to ignore", that's a different story.)
With seemingly the whole world rolling out new RTO mandates, location may not have been a factor recently, but may be lately.
Did they ever explain why netlink was involved? Or was that a red herring?
Seemed like a red herring. They were able to reproduce it without any libraries. Might have just been net link forcing the stacks to a certain size and that made the bug visible.
The stack in that specific function was big enough to trigger the bug.
The real lesson here should be that doing crazy shit like swizzling the program counter in a signal handler and writing your own assembler is not a good idea.
Neither of those are "crazy shit." It's just complex because the environment offers specific features like automatic GC with async preemption in a compiled language which pretty much requires it.
Complex engineering isn't something to be avoided by default.
Sorry, how exactly do you think compilers are supposed to work if not by 'writing [their] own assembler'? Someone has to write the assembler, and different compilers have different needs.
The general wisdom is that you shouldn't do this stuff yourself, and you should instead rely on tried and tested implementations. But sometimes you're the one who provides the tried and tested implementations. Implementing a compiled language is often one of those times.
This^. Keith W on Dtrace blog said it a decade ago https://wesolows.dtrace.org/2014/12/29/golang-is-trash/
I like Go but I don't really like their NIH / replace everything with our stuff stance - esp on system tools like assemblers and linkers.
Really enjoyed reading this. Thanks for writing it!
For the impatient, here's the fix: https://github.com/golang/go/commit/f7cc61e7d7f77521e073137c...
I noticed this when reviewing the linked issue: https://github.com/golang/go/issues/73259#issuecomment-31004...
Does the Go team have a natural language bot or is this just comment.contains(“backport”) type stuff?
The latter: https://github.com/golang/build/blob/master/cmd/gopherbot/go...
(found via https://go.dev/wiki/gopherbot)
Although also the former (gabyhelp): https://github.com/golang/oscar/tree/master/internal/gaby
I don't get it, how were the machine threads being stopped in thr middle of two instructions? This is baremetal, right?
go uses interrupts for GC notifications
Great technical blog. Good pathway for narrative, tight examples, description so clear it makes me feel smarter than I am because so easy to follow though the last time I even read assembly seriously was x86 years ago.
Also, fulfills the marketing objective because I cannot help but think that this team is a bunch of hotshots who have the skill to do this on demand and the quality discipline to chase down rare issues.
I assume these are Ampere Altra? I was considering some of those for web servers to fill out my rack (more space than power) but ended up just going higher on power and using Epyc.
I would have thought that unwinding would use the frame pointer and this wouldn't be a problem.
The frame pointer was updated non-atomically in two asm ops. An async interruption between the two ops would lead to a corrupt frame pointer.
So it was. The article never mentions the frame pointer and I'm familiar with compilers that load the saved value from the stack in the epilog, rather than adjusting it arithmetically. But they do have an assembly listing showing the two-step arithmetic adjustment for both the stack pointer and frame pointer.
But I'm not sure that matters, because the unwind code they show uses the stack pointer rather than the frame pointer anyway.
solid research, looks like you'd make a great CRUD engineer.