We found a bug in Go's ARM64 compiler

(blog.cloudflare.com)

329 points | by jgrahamc 7 hours ago ago

49 comments

lordnacho 36 minutes ago ago
> This was a very fun problem to debug.
I'm sure it was a relief to find a thorough solution that addressed the root cause. But it doesn't seem plausible that it was fun while it was unexplained. When I have this kind of bug it eats my whole attention.
Something this deep is especially frustrating. Nobody suspects the standard library or the compiler. Devs have been taught from a young age that it's always you, not the tools you were given, and that's generally true.
One time, I actually did find a standard library bug. I ended up taking apart absolutely everything on my side, because of course the last hypothesis you test is that the pieces you have from the SDK are broken. So a huge amount of time is spent chasing the wrong lead when it actually is a fundamental problem.
On top of this, the thing is a race condition, so you can't even reliably reproduce it. You think it's gone like they did initially, and then it's back. Like cancer.
[-]
- dylan604 9 minutes ago ago
  Some people are perverse individuals and actually enjoy debugging very esoteric things. What might be frustrating to you might be the very thing that gets someone else very excited.
- akerl_ 16 minutes ago ago
  It feels like this comment was almost a purely additive anecdote of your own experience with a similar kind of issue, but you've spoiled it by deciding to tell the author that they're incorrect about how they felt during the process?
  Maybe different people find different things fun.
yalok a minute ago ago
Classic problem of non-atomic stack pointer modification.
Used to have a lot of fun with those 3 decades ago.
Neywiny 4 hours ago ago
That's an incredible find and once I saw the assembly I was right along with them on the debug path. Interestingly it doesn't need to be assembly for this to work, it's just that that's where the split was. The IR could've done it, it just doesn't for very good reasons. So another win for being able to read arm assembly.
Unsure if this would be another way to do it but to save an instruction at the cost of a memory access you could push then pop the stack size maybe? Since presumably you're doing that pair of moves on function entry and exit. I'm not really sure what the garbage collector is looking for so maybe that doesn't work, but I'd be interested to hear some takes on it
[-]
- Veserv 3 hours ago ago
  You would normally use the “LDR Rd, =expr” pseudo-instruction form [1]. For immediates not directly constructible, it puts a copy of the immediate value in a PC-relative memory location, then does a PC-relative load into register.
  So that would turn the whole sequence of “add constant to SP” into 2 executable instructions, 1 for constructing immediate and 1 for adding for a total of 8 bytes, and a 4 byte data area for the 17-bit immediate for a total of 12 bytes of binary which is 3 executable instructions worth.
  [1] https://developer.arm.com/documentation/dui0801/l/A64-Data-T...
- pklausler 2 hours ago ago
  I'm a little surprised that this bug wasn't fixed in the assembler as a special case for immediate adds to RSP. If the patch was to the compiler only, other instances of the bug could be lurking out there in aarch64 assembly code.
- bloak 4 hours ago ago
  > So another win for being able to read arm assembly.
  Yes, though that weird stuff with dollars in it is not normal AArch64 assembly!
  The article could have mentioned the "stack moves once" rule.
  [-]
  - pjmlp 4 hours ago ago
    It is due to the Plan 9 Assembly dialect most likely, because it wasn't enough that we already have differences between AT&T and Intel.
    https://go.dev/doc/asm
    Still, I find great that Go got back the 1990's tradition that compiled languages have an assembler as part of their tooling, regardless of the syntax.
  - Neywiny 3 hours ago ago
    I've never heard of that rule (though tbh I'm not allocating > 64KB of stack when I'm in assembly) and it seems Google hasn't either. While I'm sure it makes sense, I don't think I've ever seen that be enforced. At least in C/C++. Maybe it makes more sense for these stack inspecting garbage collectors but I've also heard of ones that just scan the stack without unwinding anything. I did a test asking Google's AI to generate a complicated C function, put it in godbolt, and there's plenty of push push push push ..... Pop Pop Pop Pop going on
    [-]
    - JdeBP 2 hours ago ago
      You need to look at non-x86 architectures. It was common years ago on MIPS.
      * https://jdebp.uk/FGA/function-perilogues.html#StandardMIPS
      I wrote up the x86 equivalent of doing just two read-modify-write operations on the stack pointer over 16 years ago.
      * https://jdebp.uk/FGA/function-perilogues.html#Standardx86
    - rcxdude 3 hours ago ago
      Did you compile with optimisations? I think GCC will do a bunch of activity on the stack with -O0, but it'll generally coalesce everything into one push/pop per function with optimisations (not because of any rule, but just because it's faster). alloca and other dynamic stack allocation may break this, but normal variables should in pretty much all just get turned into one block on the stack (with appropriate re-use of space if variable lifetimes don't overlap)
      [-]
      - Neywiny 27 minutes ago ago
        Yes
- pjmlp 4 hours ago ago
  Usually in runtimes like Java and .NET there are safepoints exactly to avoid changing context in the middle of a set of instructions.
  [-]
  - andygocke 3 hours ago ago
    Yeah but we have codegen bugs in .NET as well. The biggest difference that stood out to me in this write up, is we would have gone straight for “coredump” instead of other investigation tools. Our default mode of investigating memory corruption issues is dumps.
    [-]
    - pjmlp 2 hours ago ago
      Sure, I have experienced them, e.g. once in 2006 using IBM's JVM implementation with Websphere.
      However it is probably not as problematic due to the way Go allows for Assembly being used directly.
      While the JVM and CLR don't allow for direct access to Assembly code, Go does, thus I assume expecting safepoints everywhere is not an option, as any subroutine call can land on code that was manually written.
- titzer 4 hours ago ago
  I think the right fix is that the compiler should, e.g. load the constant into a register using two moves and then emit a single add. It's one more instruction, but then the adjustment is atomic (i.e. a single instruction). Another option is to do the arithmetic in a temp register and then move it back.
riobard 3 hours ago ago
What ARM64 machines are you using and what are they used for? Last year you were announcing Gen 12 servers on AMD EPYC (https://blog.cloudflare.com/gen-12-servers/), but IIRC there weren’t any mentions of ARM64. But now it seems you’re running ARM64 in full production.
dreamcompiler 4 hours ago ago
Always adjust your stack pointer atomically, kids.
[-]
- whizzter 3 hours ago ago
  I guess those that wrote the preemption were on X86 where this doesn't happen thanks to variable length instructions being able to hold the constant and thus relied on the code-gen to do it atomically, then the ARM port had an automatic "split" from a higher level to make things "easy" thus giving us this bug.
  Nobodys fault really, but bad results ensued.
  [-]
  - Sesse__ an hour ago ago
    > Nobodys fault really, but bad results ensued.
    Uh, the fault is entirely in writing an assembler _that is not an assembler_, but rather something that is _almost_ like one but then 1% like an IR instead. It's an unforced error.
- drob518 an hour ago ago
  Exactly what ran through my mind.
Agingcoder 4 hours ago ago
Excellent article as always from the cloudflare blog - engineering without magic infrastructure and ml. One day I will apply !
Compiler bugs are actually quite common ( I used to find several a year in gcc ), but as the author says, some of them only appear when you work at a very large scale, and most people never dive that far.
[-]
- jgrahamc 4 hours ago ago
  What's stopping you applying today?
  [-]
  - kccqzy 14 minutes ago ago
    Low compensation relative to many other companies. (It didn't stop me from applying, but I stopped me from accepting.)
  - nevon 4 hours ago ago
    Similar to the previous commenter, every time I read a blog post from Cloudflare I end up checking the careers page thinking "this is exactly the kind of work I'd like to be doing". Sadly no openings in my country. I'll keep checking!
    [-]
    - moomoo11 3 hours ago ago
      Pretty sure location is not a factor for these companies. You should apply anyway. I’ve worked with people living in active war zones.
      If you have the skills, they have the coin.
      They won’t hire some react guy in X country but someone who can find compiler bugs and save them XX+ million dollars a year? Heck yeah.
      [-]
      - Degorath 2 hours ago ago
        Unfortunately, in 95% cases location IS a factor with bigger companies.
        I'm in a similar position where I'd like to do something a lot more interesting, but intersection between where the interesting companies have offices and where I'd be willing to live do not really overlap enough justify rooting up my life.
        (Unless we're talking about "too good to ignore", that's a different story.)
      - stronglikedan 2 hours ago ago
        With seemingly the whole world rolling out new RTO mandates, location may not have been a factor recently, but may be lately.
mperham 2 hours ago ago
Did they ever explain why netlink was involved? Or was that a red herring?
[-]
- drob518 an hour ago ago
  Seemed like a red herring. They were able to reproduce it without any libraries. Might have just been net link forcing the stacks to a certain size and that made the bug visible.
- Sesse__ an hour ago ago
  The stack in that specific function was big enough to trigger the bug.
gok 3 hours ago ago
The real lesson here should be that doing crazy shit like swizzling the program counter in a signal handler and writing your own assembler is not a good idea.
[-]
- themafia 2 hours ago ago
  Neither of those are "crazy shit." It's just complex because the environment offers specific features like automatic GC with async preemption in a compiled language which pretty much requires it.
  Complex engineering isn't something to be avoided by default.
- achierius an hour ago ago
  Sorry, how exactly do you think compilers are supposed to work if not by 'writing [their] own assembler'? Someone has to write the assembler, and different compilers have different needs.
- wat10000 2 hours ago ago
  The general wisdom is that you shouldn't do this stuff yourself, and you should instead rely on tried and tested implementations. But sometimes you're the one who provides the tried and tested implementations. Implementing a compiled language is often one of those times.
- blinkingled 3 hours ago ago
  This^. Keith W on Dtrace blog said it a decade ago https://wesolows.dtrace.org/2014/12/29/golang-is-trash/
  I like Go but I don't really like their NIH / replace everything with our stuff stance - esp on system tools like assemblers and linkers.
javierhonduco 5 hours ago ago
Really enjoyed reading this. Thanks for writing it!
pengaru 4 hours ago ago
For the impatient, here's the fix: https://github.com/golang/go/commit/f7cc61e7d7f77521e073137c...
[-]
- cmckn 3 hours ago ago
  I noticed this when reviewing the linked issue: https://github.com/golang/go/issues/73259#issuecomment-31004...
  Does the Go team have a natural language bot or is this just comment.contains(“backport”) type stuff?
  [-]
  - kbolino 3 hours ago ago
    The latter: https://github.com/golang/build/blob/master/cmd/gopherbot/go...
    (found via https://go.dev/wiki/gopherbot)
    [-]
    - 9rx 2 hours ago ago
      Although also the former (gabyhelp): https://github.com/golang/oscar/tree/master/internal/gaby
brcmthrowaway an hour ago ago
I don't get it, how were the machine threads being stopped in thr middle of two instructions? This is baremetal, right?
[-]
- adgjlsfhk1 an hour ago ago
  go uses interrupts for GC notifications
renewiltord 3 hours ago ago
Great technical blog. Good pathway for narrative, tight examples, description so clear it makes me feel smarter than I am because so easy to follow though the last time I even read assembly seriously was x86 years ago.
Also, fulfills the marketing objective because I cannot help but think that this team is a bunch of hotshots who have the skill to do this on demand and the quality discipline to chase down rare issues.
I assume these are Ampere Altra? I was considering some of those for web servers to fill out my rack (more space than power) but ended up just going higher on power and using Epyc.
wat10000 2 hours ago ago
I would have thought that unwinding would use the frame pointer and this wouldn't be a problem.
[-]
- mperham 2 hours ago ago
  The frame pointer was updated non-atomically in two asm ops. An async interruption between the two ops would lead to a corrupt frame pointer.
  [-]
  - wat10000 an hour ago ago
    So it was. The article never mentions the frame pointer and I'm familiar with compilers that load the saved value from the stack in the epilog, rather than adjusting it arithmetically. But they do have an assembly listing showing the two-step arithmetic adjustment for both the stack pointer and frame pointer.
    But I'm not sure that matters, because the unwind code they show uses the stack pointer rather than the frame pointer anyway.
berz01 2 hours ago ago
solid research, looks like you'd make a great CRUD engineer.