I work on a heavy azure shop. Talking about thousands of subscriptions and over 70000 compute resources. I gave up on Azure Portal. It goes crazy after I activate my access to the big management groups. It just doesn't work so I started writing my own scripts using the python cli and api.
We also face lots of weird errors and less than stellar support from Microsoft. I'm not going deep into these because I don't want to identify my employer but one of my biggest sources of headaches nowadays is AMPLS. I have no idea who came up with that overcomplicated and prone to failure design.
"The team had reached a point where it was too risky to make any code refactoring or engineering improvements. I submitted several bug fixes and refactoring, notably using smart pointers, but they were rejected for fear of breaking something."
Once you reach this stage, the only escape is to first cover everything with tests and then meticulously fix bugs, without shipping any new features. This can take a long time, and cannot happen without the full support from the management who do not fully understand the problem nor are incentivized to understand it.
Noticed how "the talent left after the launch" is mentioned in the article? Same problem. You don't get rewarded for cleaning up mess (despite lip service from management) nor for maintaining the product after the launch. Only big launches matter.
The other corporate problem is that it takes time before the cleanup produces measurable benefits and you may as well get reorged before this happens.
This is the root of the issue. For something like Azure, people are nor fungible. You need to retain them for decades, and carefully grow the team, training new members over a long period until they can take on serious responsibilities.
But employees are rewarded for showing quick wins and changing jobs rapidly, and employers are rewarded for getting rid of high earners (i.e. senior, long-term employees).
> For something like Azure, people are nor fungible
What I've learned from a decade in the industry is that talent is never fungible in low-demand areas. It's surprisingly hard to find people that "get it" and produce something worthwhile together.
A geographic area where there's not abundant opportunity for software developers. Usually everywhere outside the major metro areas. It was primarily meant to discount experiences from SF or Seattle where I'm sure finding talent is easy enough, assuming you are willing to pay.
Its a cool talent filter though, if you higher people the set of people that quit on doomed projects and how fast they quit is a real great indicator of technological evaluation skills.
No joke, I worked at a place where in our copy of system headers we had to #define near and far to nothing. That was because (despite not having supported any systems where this was applicable for more than a decade) there was a set of files that were considered too risky to make changes in that still had dos style near and far pointers that we had to compile for a more sane linear address space. https://www.geeksforgeeks.org/c/what-are-near-far-and-huge-p...
Now, I'm just a simple country engineer, but a sane take on risk management probably doesn't prefer de facto editing files by hijacking keywords with template magic compared with, you know just making the actual change, reviewing it, and checking it in.
unfortunately, what you will find is that unless you get lucky, the next ship is more of the same.
The system/management style is ingrained in corporate culture of large-ish companies (i would say if it has more than 2 layers of management from you to someone owning the equity of the business and calling the shots, it's "large").
It stems from the fact that when an executive is bestowed the responsibility of managing a company from the shareholders, the responsibility is diluted, and the agent-principle problem rears their ugly head. When several more layers of this starts growing in a large company, the divergence and the path of least resistance is to have zero trust in the "subordinates", lest they make a choice that is contrary to what their managers want.
The only way to make good software is to have a small, nimble organization, where the craftsman (doing the work) makes the call, gets the rewards, and suffers the consequences (if any). That aligns the agent-principle together.
Hierachy is the enemy of succeding projects and information flow. The more important and complex hierarchy in a culture the less likely it is to have a working software industry. Germanys and japanese endless :"old vs young, seniority vs new, internal vs external, company wide management vs project local management come to mind. Its guerilla vs army, startup vs company allover..
> Once you reach this stage, the only escape is to first cover everything with tests and then meticulously fix bugs
The exact same approach is recommended in the book "Working effectively with legacy code" by Michael Feathers, with several techniques on how to do it. He describes legacy code as 'code with no tests'.
Hence the rewrite-it-in-Rust initiative, presumably. Management were aware of this problem at some level but chose a questionable solution. I don't think rewriting everything in Rust is at all compatible with their feature timelines or severe shortages of systems programming talent.
If 90% of the code I run is in safe rust (including the part that's new and written by me, therefore most likely to introduce bugs) and 10% is in C or unsafe rust, are you saying that has no value?
Il meglio è l'inimico del bene.
Le mieux est l'ennemi du bien.
Perfect is the enemy of good.
If you're sufficiently stubborn, it's certainly possible to call directly into code written in Verilog, held together with inscrutable Perl incantations.
High-level languages like C certainly have their place, but the space seems competitive these days. Who knows where the future will lead.
If you want something extra spicy, there are devices out there that implement CORBA in silicon (or at least FPGA), exposing a remote object accessible using CORBA
Once you reach this stage, honestly the only escape is real escape. Put your papers in and start looking for a job elsewhere, because when they go down, they will go down hard and drag you with them. It's not like you didn't try.
I worked at a startup that was using Azure. The reason was simple enough - it had been founded by finance people who were used to Excel, so Windows+Office was the non-negotiable first bit of IT they purchased. That created a sales channel Microsoft used to offer generous startup credits. The free money created a structural lack of discipline around spending. Once the startup credits ran out, the company became faced with a huge bill and difficulty motivating people to conserve funds.
At the start I didn't have any strong opinion on what cloud provider to use. I did want to do IT the "old fashioned way" - rent a big ass bare metal or cloud VM, issue UNIX user accounts on it and let people do dev/test/ad hoc servers on that. Very easy to control spending that way, very easy to quickly see what's using the resources and impose limits, link programs to people, etc. I was overruled as obviously old fashioned and not getting with the cloud programme. They ended up bleeding a million dollars a month and the company wasn't even running a SaaS!
I ended up having a very low opinion of Azure. Basic things like TCP connections between VMs would mysteriously hang. We got MS to investigate, they made a token effort and basically just admitted defeat. I raged that this was absurd as working TCP is table stakes for literally any datacenter since the 1980s, but - sad to say - at this time Azure's bad behavior was enabled by a widespread culture of CV farming in which "enterprise" devs were all obsessed with getting cloud tech onto their LinkedIn. Any time we hit bugs or stupidities in the way Azure worked I was told the problem was clearly with the software I'd written, which couldn't be "cloud native", as if it was it'd obviously work fine in Azure!
With attitudes like that completely endemic outside of the tech sector, of course Microsoft learned not to prioritize quality.
Because Azure customers are companies that still, in 2026 only use Windows. Anyone else uses something else. Turns out, companies like that don't tend to have the best engineering teams. So moving an entire cloud infrastructure from Azure to say AWS, probably is either really expensive, really risky or too disruptive to do for the type of engineering team that Azure customers have. I would expect MS to bleed from this slowly for a long time until they actually fix it. I seriously doubt they ever will but stranger things have happened.
Turns out outside companies shipping software products aspiring to be the next Google or Apple, most companies that work outside software industry also need software to run their business and they couldn't care less about HN technology cool factor.
They use whatever they can to ship their products into trucks, outsourcing their IT and development costs , and that is about it.
most the upper management of companies who use them have dont have the technical competence to see it. (eg: banks, supermarket chains, manufacturing companies)
once they are in, no one likes to admit they made a mistake.
Because the alternatives are also in similar state.
AWS or GCP are all pretty crap. You use any of them, any you'll hit just enough rough edges. The whole industry is just grinding out slop, quality is not important anywhere.
I work with AWS on a daily basis, and I'm not really impressed. (Also nor did GCP impress me on the short encounter I had with it)
This reads like it was written by the cleverest person in the room. I have to use Azure Devops at work, and some of the critique of Azure rings true, but the author-centric presentation was quite off-putting.
I don't know if any of this is true, but as a user of Azure every day this would explain so much.
The Azure UI feels like a janky mess, barely being held together. The documentation is obviously entirely written by AI and is constantly out of date or wrong. They offer such a huge volume of services it's nearly impossible to figure out what service you actually want/need without consultants, and when you finally get the services up who knows if they actually work as advertised.
I'm honestly shocked anything manages to stay working at all.
I’ve created a bunch of fresh Azure accounts over the past few years and each time I’ve found myself sitting there dumbfounded anew at how garbage the experience is.
There has been weird broken jank at just about every step of the process at one point or another. Like, I’m a serious person trying to set something up for a production workload, and multiple times along the way to just having a working account that I can log into with billing configured, I’ll get baffling error messages like [ServiceKeyDepartureException: Insufficient validation expectancy. Sfhtjitgfxswinbvgtt-33-664322888], and the whole thing will simply not work until several hours later. Who knows why!?
I evaluated some Azure + Copilot Studio functionality for a project recently, which required more engagement with their whole 365 ecosystem than I’d had in a long time and it had many of the same problems but worse. Just unbelievably low quality software for the price and how popular it is. Every step of the way I hit some stupid issue. The people using this stuff are clearly not the people buying it.
I've joked that on some services, when you're clicking buttons, you're actually opening tickets that a human needs to action.
That scenario is an example. You complete an action on a web page and nothing works. You make no further changes and hours later it works perfectly. Your human wasn't fast enough that day.
That's the "digital escort" process mentioned in the very long OP. Understandably, the US government got mad when they found out that cheap Chinese tech support staff were being used for direct intervention on "secure" VMs.
That's not what the "problem" was. It's that cheap American support people were "escorting" foreign Microsoft SWEs, so they could manage and fix services they wrote and were the subject matter experts for in the sovereign cloud instances which they otherwise would have no access to.
And this was NOT for the government clouds we have that hold classified data. Those are air-gapped clouds that physically cannot be accessed by anyone who doesnt have a TS clearance and physically go into a SCIF.
source: I work in a team very closely related the team who designed digital escort.
Yes but this misses the underlying point: this is the same software. It suffers from the same defects. If your management stack keeps crashing and leaking VMs you are seeing a reduction in the operational capacity of the fleet. If you are still there just tour Azure Watson and tell me if you’d want the military to rely on that system in wartime? Don’t forget things like IVAS and God knows what else that are used during operations while Azure node agents happily crash and restart on the hosts. The system should be no-touch and run like an appliance, which is predicated on zero crashes or 100% crash resiliency. In Windows Core we pursued a single Watson bucket with a single hit until it was fixed. Different standards.
I'm only commenting on parent comment's understanding of what digital escort process is specifically. Escort is used by all kinds of teams that are just doing day-to-day crap for various resource providers across azure. I've never worked anywhere close to Azure Core so I don't know about these more low-level concerns. Overall I agree and sympathize with your assessment of the engineering culture.
You also make it sound like getting a JIT approved is getting keys to the kingdom. It's not -- every team has it's own JIT policies for their resources. Should there be far less manual touches? Ideally. But JIT is better than persistent access at least, and JIT policies should be scoped according to principle of least privilege. If that is not happening, it's a failure at the level of that specific org.
> when you're clicking buttons, you're actually opening tickets that a human needs to action
I had one public cloud vendor sales literally admit this was the case with their platform. But they were now selling "the new one" which is supposed to be better.
I remember being impressed with the Azure docs... until I spend a week implementing something, only to have it completely fail when deployed to the test environment because GraphAPI did not work as documented. The beautiful docs were a complete lie.
These days I don't even bother looking at the docs when doing stuff with Azure.
And they were actually like that pre-LLM, in 2019, when I was implementing stuff for a car company on azure. They spent _hundreds of thousands_ on cosmosDB, for less performance than a raspberry pi running Postgres.
Pretty surprised to hear this. I would think (assuming they are LLM written as parent suggests), that MS could throw a large context "pro" LLM at the code base and you should get perfect docs, updated every release?
More perfect than a person where I might mistakenly copy/paste or write "Returns 404" but the LLM can probably see actually return a 401.
I'm not a stranger to LLMs hallucinating things in responses but I'd always assumed that disappeared when you actually pointed it at the source vs some nebulous collection of "knowledge" in a general LLM.
Is it your first time using an LLM? No, they generate plausible-sounding bullshit no matter the input. Sometimes that bullshit is useful. Other times it isn't.
Azure container apps are a great (idea) and work mostly fine as long as you don’t need to touch them. But they’re just like GCR or what fargate should be - container + resources and off you go.
We ran many internal workloads on ACA, but we had _so may issues_ with everything else around ACA…
The only good thing Microsoft azure ever did for me was provide a very easy way to exploit their free trial program in the early 2010s to crypto mine for free. It couldn’t do much, but it was straight up free real estate for CPU mining. $200 or 2 weeks per credit/debit card.
The part about prioritizing "aggressive feature velocity" over "core fundamentals" is true.
The push is as insane as push to AI.
At the same time fundamental improvements like migrating to .net core, or reducing logs is actively deprioritised. If it were not for compliance, we would not have any core engineering improvement at all
Honestly, I was not even aware of rust push, probably cause no one in my org could do rust. I am glad we did not move to AKS though
We migrated some services to AKS because the upper management thought it was a good deal to get so many credits, and now pods are randomly crashing and database nodes have random spikes in disk latency. What ran reliably on GCP became quite unpredictable.
Exact same story at my place. Upper management decided it's a good idea to build on Azure because Microsoft promised some benefits. Things that ran reliable on GCP now need active firefighting on Azure
Interesting!
We're using AKS with huge success so far, but lately our Pods are unresponsive and we get 503 Gateway Timeouts that we really can't trace down.
And don't get me started on Azure Blob Tables...
In our case, we spent to much time of engineer time just to put up with Azure but there’s no good ROI. It took sometime for the upper management to realize Azure is shit and cut the cost
> I've never seen an SLA which is clear cut enough to be worth pursuing if you want more than a free t-shirt.
I have, regularly. I am not sure what kind of business you are running but parties that rely on service providers for critical (primary business process driving) components routinely agree to SLAs with large penalties and the ability to open up an existing contract in case of non-performance. Obviously you would have to be willing to pay for such a service in the first place otherwise there is no point in setting up an SLA, this won't be cheap. But we're definitely not talking about 'free t-shirts' here, more about direct liability, per hour penalties and so on.
By the time SLA thresholds are being breached you've been through months (or years) of pain. They're not strong enough or specific enough to save you from a bad provider. ymmv
Colo and cloud providers that provide real SLAs exist. But they're pricey because they tend to insure against breach of that that SLA and they pass on the cost of that insurance. If you're a run-of-the-mill e-commerce company then it probably doesn't make much sense. But if you yourself are providing critical services to others and they have you by the short hairs in case you don't perform you better make sure that you're not going to end up holding the bag.
One simple example: energy market services, 15 minute ahead and day ahead markets require participants to have the ability to perform or they will be penalized severely, to the point where they can lose that access, the damage of which could easily be in the 10's of millions to 100's of millions depending on their size. Asset owners and utilities both would be able to hit them hard if they do not perform, the asset owners for lost income and the utilities for both government penalties and possibly for outages and all associated costs. These are not the kind of contracts you enter into lightly.
Exactly what I was thinking. But then again, from what I've seen, the persons responsible for monitoring uptimes are often much further removed from the C suite in these "committed-spend" companies.
A business man at a prior employer sympathetic with my younger, naive "Microsoft sucks" attitude told me something I remember to this day:
Microsoft is not a software company, they have never been experts at software. They are experts at contracts. They lead because their business machine exceeds at understanding how to tick the boxes necessary to win contract bids. The people who make purchasing decisions at companies aren't technical and possibly don't even know a world outside Microsoft, Office, and Windows, after all.
This is how the sausage is made in the business world, and it changed how I perceived the tech industry. Good software (sadly) doesn't matter. Sales does.
This is why most of Norway currently runs on Azure, even though it is garbage, and even though every engineer I know who uses it says it is garbage. Because the people in the know don't get to make the decision.
My lesson was when European companies followed US tech into offshoring, and how quality doesn't play any role as long as the software delivers, from business point of view.
Especially relevant when shipping software isn't the product the company sells.
The biggest expense in software is maintenance. Better software means cheaper maintenance. If you actually want to have a significant cost advantage, software is the way to go. Sadly most business is about sales and marketing and has little to do with the cost or quality of items being sold.
It will depend on each case and what makes the marketed solution inferior. If it's overly complex and you will save development time. If it's unstable you'll save debugging time. If it's bloated you will save on hardware costs. Etc...
Most customers don't really have the knowledge needed to make choices based on technical merits, and that's why the market works as it does. I'm willing to say 95% of people on HN have this knowledge and are therefore biased to assume others are the same way. It's classic XKCD 2501.
What are we reading here? These are extraordinary statements. Also with apparent credibility. They sound reasonable. Is this a whistleblower or an ex employee with a grudge? The appearance is the first. Is it? They’ve put their name to some clear and worrying statements.
> On January 7, 2025… I sent a more concise executive summary to the CEO. … When those communications produced no acknowledgment, I took the customary step of writing to the Board through the corporate secretary.
Why is that customary? I have not come across it, and though I have seen situations of some concern in the past, I previously had little experience with US corporate norms. What is normal here for such a level of concern?
More, why is this public not a court case for wrongful termination?
Is Azure really this unreliable? There are concrete numbers in this blog. For those who use Azure, does it match your external experience?
>Is Azure really this unreliable? There are concrete numbers in this blog. For those who use Azure, does it match your external experience?
IME, yes.
I'm currently working as an SRE supporting a large environment across AWS, Azure, and GCP. In terms of issues or incidents we deal with that are directly caused by cloud provider problems, I'd estimate that 80-90% come from Azure. And we're _really_ not doing anything that complicated in terms of cloud infrastructure; just VMs, load balancers, some blob storage, some k8s clusters.
Stuff on Azure just breaks constantly, and when it does break it's very obvious that Azure:
1. Does not know when they're having problems (it can take weeks/months for Azure to admit they had an outage that impacted us)
2. Does not know why they had problems (RCAs we're given are basically just "something broke")
3. Does not care that they had problems
Everyone I work with who interacts with Azure at all absolutely loathes it.
As a former MSFTy it does sound weird to me too. I didn’t see what Axels level was but a lot of people work for Microsoft and not many of them can expect to email the CEO and get a response. It seems a bit like a crash out, not the first I’ve seen levied at Azure, won’t be the last. They probably think it’s a mental health episode, if you’re an important CEO crazy people will email you all the time and the staff probably filter them out before they see it. Also this is a lot of internal gossip, I would be worried that airing this publicly would impinge on future career opportunities, even healthy orgs would appreciate some discretion.
I’m sure everything he said is completely true, Azure is one of the few tech stacks I refuse to work with and the predominant reason I left.
If you’ve joined an org and nothing works the reason is usually that the org is dysfunctional and there is often very little you can do about it, and you’re probably not the first person who’s tried and failed at it.
Never worked at a FAANG, but from what I read from their cultures I don't think a letter to the CEO from a senior engineer would go entirely unnoticed there. CEO's might receive crazy letters, but hopefully not regularly from their senior engineering staff..
In my experience Azure is full of consistency issues and race conditions. It's enough of an issue that I was talking about new OpenAI models becoming available via Bedrock on AWS and how convenient that was since I wouldn't have to deal with Azure and my colleague in enterprise architecture went on an unprompted rant about these exact issues. It's not the first time something like this has happened and I've experienced these issues first hand, so yes. I'd say reliability is a critical issue for Azure and it hasn't gotten better each time I've gone back to check.
I recall seeing some pretty damning reports from a security pentester that was able to escape from a container on Azure and found the management controller for the service was years old with known critical unpatched vulnerabilities. Always been a bit sceptical of them since then
Large orgs make decisions that prioritize short-term metrics over long-term quality all the time and nobody tracks whether those tradeoffs actually paid off. The decision to ship fast and fix later sounds reasonable in a meeting setting until articles like this surface and the reality comes through clearly.
Wild guess, touching this with a 10-foot pole risks validating his claims. If they sue for breach of NDA, it means his claims are factually correct, and if they sue for libel and it goes to court, they may be forced to submit documents they don't want to.
> What are we reading here? These are extraordinary statements. Also with apparent credibility.
I left Microsoft in 2014. Already back then I could see this sort of stuff starting to happen.
The Office Org was mostly immune from it because they had a lot of lifers, people who had been working on the same code for decades and who thought through changes slowly.
But even by 2014 there were problems hiring developers who knew C++, or who wanted to learn it. COM? No way. One one team we literally had to draw straws once to determine who was going to learn how to write native code for Windows.
It wasn't even a talent thing, Windows development skills are a career dead end outside of Microsoft. They used to be a hot commodity, and Microsoft was able to hire the best of the best from industry. Now they have to train people up, and Microsoft doesn't offer any of the employment perks that they used to use to attract top talent (Seattle used to be a low CoL area, everyone had private offices, job stability).
When I started at Microsoft in 2007, the interview bar included deep knowledge of how computers worked. It wasn't unusual to have meetings drop down to talking about assembly code. Your first day after orientation was a bunch of computer parts and you were told to "figure out how to setup your box".
Antivirus wasn't mandatory. The logic was if you got a virus, they made a mistake hiring you and you deserved to be fired.
When your average developer can go that deep on any topic, you can generally leave engineers well enough alone and get good software.
On the other hand there was e.g. CVE-2021-1647 where Microsoft's antivirus would compromise the PC with no user action.
(At least I think that's the one I'm thinking of. It's marked as a high-severity RCE with no user interaction but they don't give any details. There was definitely at least one CVE where Windows Defender compromised the system by unsafely scanning files with excessive privileges.)
Yeah I thought that was extreme. An engineer going to the board of any corporation let alone Microsoft is not normal or customary IME. That could explain why they got no response.
The CEO is accountable to the board. If they are derelict in their obligations to the company, that's where you need to raise a stink so they can fix it.
Well, yeah, that’s what a board does, but I think the issue is whether it is customary to go to the board directly in this situation. The answer is a resounding NO. Very odd, but cool idea and approach.
Maybe naive, but why not? If it's a serious enough issue, and you're not getting anywhere through your management chain all the way up to the CEO, why is it novel to contact the people the CEO reports to? They're not royalty, they're other human beings who also eat, piss and fart like everyone else.
Before 6 years of Google I’d co-sign what you said, but it never ever plays out that way.
The law of the jungle is an iron law, make people around you feel bad, be a tattletale, and you’re choosing to be ostracized.
That said yr interlocutor disturbs me a bit because yes, they certainly will make it out to be a mental health episode. But the implicit deal there is “STFU. You can even take paid health leave.” It’s not healthy either. BigCo is insane I’ll never work for one again without outrageous comp.
You’d be stunned by even the simplest story. Ex. a year in some crazy shit was going down and my manager asked for my thoughts on a topic, I was honest and basically said “I don’t think it’s a good idea, but in my experience, raising issues involving people only raises more issues.” He swore up and down it wouldn’t be a problem, eventually made a deal I could email it to him privately. Next 1:1 with my area lead was horrible, them seeing red, hearing a mistranslated version of what I said, and I had 0 warning.
The post is so dramatized and clearly written by someone with a grudge such that it really detracts from any point that is trying to be made, if there is any.
From another former Az eng now elsewhere still working on big systems, the post gets way way more boring when you realize that things like "Principle Group Manager" is just an M2 and Principal in general is L6 (maybe even L5) Google equivalent. Similarly Sev2 is hardly notable for anyone actually working on the foundational infra. There are certainly problems in Azure, but it's huge and rough edges are to be expected. It mostly marches on. IMO maturity is realizing this and working within the system to improve it rather than trying to lay out all the dirty laundry to an Internet audience that will undoubtedly lap it up and happily cry Microslop.
Last thing, the final part 6 comes off as really childish, risks to national security and sending letters to the board, really? Azure is still chugging along apparently despite everything being mentioned. People come in all the time crying that everything is broken and needs to be scrapped and rewritten but it's hardly ever true.
It wasn’t specifically about the escort sessions from any particular country, though, but about the list of underlying reasons why direct node access was necessary.
> Last thing, the final part 6 comes off as really childish, risks to national security and sending letters to the board, really?
That struck me too. Maybe i've never worked high enough in an org (im unclear how highly ranked the author of the piece is) but i've never been in an org where going over your boss's boss's boss's boss's head and writing a letter to the board was likely to go well.
That said, i could easily believe that both Azure is an absolute mess and that the author of the piece was fired because of how he went about things.
I work in azure. It’s a mess, but what large system isn’t. Now extrapolate the to one of the biggest systems in existence.
The only reason a low level employee like OP is emailing satya is because they have a personality disorder or are having a psychotic break, which is pretty clear from OP’s manifesto
It is true that writing to the board will get you noticed, and that you might not like the consequences. If you value having the job then don’t write to the board. Even if you are right, being noticed like that isn’t going to endear you to your boss.
But if you care more about doing the right thing then writing to the board is the right thing to do. And after a few years of working at Microsoft you might not value your job very much either and you too might decide to go out in style.
Windows is ~500 times bigger than Azure, give or take, by machine count, and still many times larger by loc, modules, users, whatever else you want to measure. The heavy lifting (VM/containers, I/O, the things that cannot not be done just like that) is handled by the Windows folks anyway. The only hard part is the VM placement, everything else is mostly regular software engineering, some of medium-hard complexity but nothing that can excuse the need for constant human intervention.
It is, but “Microsoft runs on trust” they say. They also say the CEO’s inbox is always open, actually the CEO himself says it in the yearly mandatory training video on business conduct. So it should be safe, in theory, to openly speak out in the best interest of the customers, no? Rhetorical question :)
AWS and Google Cloud are both huge and are significantly better in UX/DX. My only experience with Azure was that it barely worked, provided very little in the way of information about why it didn't. I only have negative impressions of Azure whereas at least GC and AWS I can say my experiences are mixed.
> From another former Az eng now elsewhere still working on big systems, the post gets way way more boring when you realize that things like "Principle Group Manager" is just an M2 and Principal in general is L6 (maybe even L5) Google equivalent. Similarly Sev2 is hardly notable for anyone actually working on the foundational infra.
Before the days of title inflation across the industry, a a Principal at Microsoft was a rare thing. When I was there, the ratio was maybe 1 principal for every 30 developers. Principals were looked up to, had decades of experience, and knew their shit really well. They were the big guns you called in to fix things when the shit really hit the fan, or when no one else could figure out what was going on.
Thanks. That reference is correct. The point is why those sessions were necessary because there is no reason, a-priori, to do manual touches on production systems, DoD or not.
Yes it's easy to critique any large system or organisation, to then go over everyone's head and cry to the CEO and Board is snake like behaviour especially offering you self as the answer to fix it. OP will be marked as a troublemaker and bad team member.
Microsoft is the go to solution for every government agency, FEDRAMP / CMMC environments, etc.
> People come in all the time crying that everything is broken and needs to be scrapped and rewritten but it's hardly ever true.
This I'm more sympathetic to. I really don't think his approach of "here's what a rewrite would look like" was ever going to work and it makes me think that there's another side to this story. Thinking that the solution is a full reset is not necessarily wrong but it's a bit of a red flag.
At no point during the reading I got sense that he's suggesting something radical. Where specifically is he pointing out rewrite?
"The practical strategy I suggested was incremental improvement... This strategy goes a long way toward modernizing a running system with minimal disruption and offers gradual, consistent improvements. It uses small, reliable components that can be easily tested separately and solidified before integration into the main platform at scale." [1]
> The current plans are likely to fail — history has proven that hunch correct — so I began creating new ones to rebuild the Azure node stack from first principles.
> A simple cross-platform component model to create portable modules that could be built for both Windows and Linux, and a new message bus communication system spanning the entire node, where agents could freely communicate across guest, host, and SoC boundaries, were the foundational elements of a new node platform
Yes, I read that part as well and found it a bit confusing to reconcile with this one.
The vibe from my quotes is very much "I had a simple from-scratch solution". They mention then slowly adopting it, but it's very hard to really assess this based on just the perspective of the author.
He also was making suggestions about significantly slowing down development and not pursuing major deals, which I think again is not necessarily wrong but was likely to fall on deaf ears.
Interesting point. The two stances are not contradictory. The end result is a new stack, so you are right saying that was the intent. However how you get there on a running system is through stepwise improvements based on componentization and gradual replacement until everything is new. Each new component clears more ground. I never imagined an A/B switch to a brand new system rewritten from scratch.
> Microsoft is the go to solution for every government agency, FEDRAMP / CMMC environments, etc.
I've been involved with FEDRAMP initiatives in the past. That doesn't mean as much as you'd think. Some really atrocious systems have been FEDRAMP certified. Maybe when you go all the way to FEDRAMP High there could be some better guardrails; I doubt it.
Microsoft has just been entrenched in the government, that's all. They have the necessary contacts and consultants to make it happen.
> Thinking that the solution is a full reset is not necessarily wrong but it's a bit of a red flag.
The author does mention rewriting subsystem by subsystem while keeping the functionality intact, adding a proper messaging layer, until the remaining systems are just a shell of what they once were. That sounds reasonable.
> I've been involved with FEDRAMP initiatives in the past. That doesn't mean as much as you'd think. Some really atrocious systems have been FEDRAMP certified. Maybe when you go all the way to FEDRAMP High there could be some better guardrails; I doubt it.
I never said otherwise. I said that Microsoft services are the defacto tools for FEDRAMP. I never implied that those environments are some super high standard of safety.
> Microsoft has just been entrenched in the government, that's all.
Yes, this is what I was saying.
> The author does mention rewriting subsystem by subsystem while keeping the functionality intact, adding a proper messaging layer, until the remaining systems are just a shell of what they once were. That sounds reasonable.
It sounds reasonable, it's just hard to say without more insight. We're getting one side of things.
Thanks. That was exactly the plan. Full rewrites are extremely risky (see the 2nd System syndrome) as people wrongly assume they will redo everything and also add everything everyone always wanted, and fix all dept, and do it in a fraction of the time, which is delusional and almost always fail. Stepwise modernization is a proven technique.
As someone who had worked adjacent to the functionally-same components (and much more) at your biggest competitor, you have my sympathy.
Running 167 agents in the accelerator? My gawd that would never fly at my previous company. I'd get dragged out in front of a bunch of senior principals/distinguished and drawn and quartered.
And 300k manual interventions per year? If that happened on the monitoring side , many people (including me) would have gotten fired. Our deployment process might be hack-ish, but none of it involved a dedicated 'digital escort' team.
I too have gotten laid off recently from said company after similar situation. Just take a breath, relax, and realize that there's life outside. Go learn some new LLM/AI stuff. The stuff from the last few months are incredible.
We are all going to lose our jobs to LLM soon anyway.
Meaning Msft Principal is below L5? I got the same feedback from one of my friends who works at Google. She said quality of former MSFT engineers now working at Google was noticeably lower.
In fairness the SECWAR is hardly a computing expert.
But in this case the SECWAR has been properly advised. If anything it's astonishing that a program whereby China-based Microsoft engineers telling U.S.-based Microsoft engineers specific commands to type in ever made it off the proposal page inside Microsoft, accelerated time-to-market or not.
It defeats the entire purpose of many of the NIST security controls that demand things like U.S.-cleared personnel for government networks, and Microsoft knew those were a thing because that was the whole point to the "digital escort" (a U.S. person who was supposed to vet the Chinese engineer's technical work despite apparently being not technical enough to have just done it themselves).
Some ideas "sell themselves", ideas like these do the opposite.
> If anything it's astonishing that a program whereby China-based Microsoft engineers telling U.S.-based Microsoft engineers specific commands to type in ever made it off the proposal page inside Microsoft, accelerated time-to-market or not.
> It defeats the entire purpose of many of the NIST security controls that demand things like U.S.-cleared personnel for government networks, and Microsoft knew those were a thing because that was the whole point to the "digital escort" (a U.S. person who was supposed to vet the Chinese engineer's technical work despite apparently being not technical enough to have just done it themselves).
Holy fuck. Ok, this will change things considerably for some companies I'm working with that had moved their stuff to Azure. Thanks. More than I can express on here.
I'm sympathetic to the viewpoint but I'm not in the habit of policing the names people use for themselves.
I've certainly done more than my fair share of jobs in the Navy where the office I was formally billeted to had long since ceased to actually exist as described due to office renamings. Often things as simple as a department section being elevated into a department branch and people using the new name even while they wait 1-2 years for the manpower records to be fixed and the POM process to cycle through for program resourcing. But still, seems hard to treat it as a crime at one level when no one blinked an eye at the lower level.
Maybe Congress will eventually step in, but in the meantime the American voters made their choice about who they want to run these agencies, so...
The main title of the office is still “secretary of defense”, the executive order added a secondary title of the department and the office, it didn't replace the primary titles.
This was such a genuinely weird moment for me when reading the article.
"yadda yadda and then also the secretary of defence agreed it was bad"
I'm just reading along and going, "yeah that sounds really bad if a secretary level position is being cited... wait a second, isn't that actually the guy who is literally famous for being stupid??"
I never expected to be living through a real life version of "the emperor's new clothes", like, how is anyone quoting this guy about anything?
> People come in all the time crying that everything is broken and needs to be scrapped and rewritten but it's hardly ever true.
Or… you’ve just normalised the deviation.
One of the few reliable barometers of an organisation (or their products) is the wtf/day exclaimed by new hires.
After about three or four weeks everyone adapts, learns what they can and can’t criticise without fallout, and settles into the mud to wallow with everyone else that has become accustomed to the filth.
As an Azure user I can tell you that it’s blindingly obvious even from the outside that the engineering quality is rock bottom. Throwing features over the fence as fast as possible to catch up to AWS was clearly the only priority for over a decade and has resulted in a giant ball of mud that now they can’t change because published APIs and offered products must continue to have support for years. Those rushed decisions have painted Azure into a corner.
You may puff your chest out, and even take legitimate pride in building the second largest public cloud in the world, but please don’t fool yourself that the quality of this edifice is anything other than rickety and falling apart at the seams.
Remind me: can I use IPv6 safely yet? Does it still break Postgres in other networks? Can azcopy actually move files yet, like every other bulk copy tool ever made by man? Can I upgrade a VM in-place to a new SKU without deleting and recreating it to work around your internal Hyper-V cluster API limitations? Premium SSDv2 disks for boot disks… when? Etc…
You may list excuses for these quality gaps, but these kinds of things just weren’t an issue anywhere else I’ve worked as far back as twenty years ago! Heck, I built a natively “all IPv6” VMware ESXi cluster over a decade ago!
> One of the few reliable barometers of an organisation (or their products) is the wtf/day exclaimed by new hires.
Eh, I don't think this is exactly as reliable as you'd expect.
My previous job had a fairly straight forward code base but had fairly poor reliability for the few customers we had, and the WTF portions usually weren't the ones that caused downtime.
On the other hand, I'm currently working on a legacy system with daily WTFs from pretty much everyone, with a greater degree of complexity in a number of places, and yet we get fewer bug reports and at least an order of magnitude if not two more daily users.
With all of that said... I don't think I've used any of Microsoft's new software in years and thought to myself "this feels like it was well made."
This is insane, when you say azure OpenAI, do you mean like github copilot, microsoft copilot, hitting openai’s api, or some openai llm hosted on azure offering that you hit through azure? This is some real wild west crap!
I have noticied a similar bug on Copilot. I noticed a chat session with questions that I had no recollection of asking. I wonder if it's related. I brushed it off as the question was generic.
Fun ones include people trying to get GPT to write malware.
I can’t help create software that secretly runs in the background, captures user activity, and exfiltrates it. That would meaningfully facilitate malware/spyware behavior.
If your goal is legitimate monitoring, security testing, or administration on systems you own and where users have given informed consent, I can help with safe alternatives, for example:
- Build a visible Windows tray app that:
- clearly indicates it is running
- requires explicit opt-in
- stores logs locally
- uploads only to an approved internal server over TLS
- Create an endpoint telemetry agent for:
- process inventory
- service health
- crash reporting
- device posture/compliance
- Implement parental-control or employee-monitoring software with:
- consent banners
- audit logs
- uninstall instructions
- privacy controls and data retention settings
I can also help with defensive or benign pieces individually, such as:
- C# Windows Service or tray application structure
- Secure HTTPS communication with certificate validation
- Code signing and MSI installer creation
- Local encrypted logging
- Consent UI and settings screens
- Safe process auditing using official Windows APIs
- How to send authorized telemetry to your own server
If you want, I can provide a safe template for a visible C# tray app that periodically sends approved system-health telemetry to your server
> Microsoft, meanwhile, conducted major layoffs—approximately 15,000 roles across waves in May and July 2025 —most likely to compensate for the immediate losses to CoreWeave ahead of the next earnings calls.
This is what people should know when seeing massive layoffs due to AI.
I honestly thought this was one of the weaker points of the article.
The OpenAI deal almost certainly related purely to GPU capacity, which had little to do with the article. The layoffs would have happened regardless.
IMO - churn, and generalization is the root cause. Engineers are thrown on projects for a year with little prior experience, leave others to pickup the pieces, etc. There's no longer a sense of ownership, and I'm sure the recent wave of layoffs isn't helping with this.
Well "Outlook (new)" finally stopped OOM-ing on my very normal-sized inbox, so I went back to using it over Outlook Classic... Can't say I notice a difference much these days.
(Not a residential inbox, the "I work in IT" sized inbox with all the email alerts about jobs failing...)
Some previous colleague of mine has to work with Azure on their day to day, and everything explained in this article makes a lot of sense when I get to hear about their massive rantings of the platform.
12 years ago I had to choose whether to specialize myself in AWS, GCP or Azure, and from my very brief foray with Azure I could see it was an absolute mess of broken, slow and click-ops methodology. This article confirms my suspicions at that time, and my colleague experience.
What makes anyone start a new project and think “I know, I’ll use Azure!”? I really don’t get it. Do they have a great sales org? Is it because a phb thinks “well they made Office so it must be good”?
I interviewed with a Dutch energy company migrating infra from AWS -to- Azure and I have no idea what would make them do that (aside from inertia, but then why use Azure in the first place?)
And for some reason Azure usage is rampant in Europe.
In some places the purchasing decisions are not made by technical people. The infrastructure team gets azure budget and that's what they have to work with.
At my work the sales people regularly come to us with some azure discount they got offered on linkedin or some event. Luckily I have the power to tell them to fuck off.
A lot of enterprise orgs are completely helpless without Microsofts' identity solutions. That's what makes it easy to just adopt more and more Microsoft products.
At the startup I worked at in 2023, Azure was considered the only “safe” way to use OpenAI APIs in prod (eg agreements that the data couldn’t be used for training).
Working with Azure was one of the worst parts of that job.
> The direct corollary is that any successful compromise of the host can give an attacker access to the complete memory of every VM running on that node. Keeping the host secure is therefore critical.
> In that context, hosting a web service that is directly reachable from any guest VM and running it on the secure host side created a significantly larger attack surface than I expected.
It is kind of a fundamental risk of IMDS, the guest vms often need some metadata about themselves, the host has it. A hardened, network gapped service running host side is acceptable, possibly the best solution. I think the issue is if your IMDS is fat and vulnerable, which this article kind of alludes to.
There’s also the fact that azure’s implementation doesn’t require auth so it’s very vulnerable to SSRF
You could imagine hosting the metadata service somewhere else. After all there is nothing a node knows about a VM that the fabric doesn’t. And things like certificates comes from somewhere anyway, they are not on the node so that service is just cache.
Hosting IMDS on the host side is pretty much the only reasonable way to provide stability guarantees. It should still work even if the network is having issues.
That being said, IMDS on AWS is a dead simple key-value storage. A competent developer should be able to write it in a memory-safe language in a way that can't be easily exploited.
Why would an Azure customer need to query this service at all? I was not aware this service even exists- because I never needed anything like it. AFAI can tell, this service tells services running on the VM what SKU the VM is. But how is this useful to the service? Any Azure users could tell how they use IMDS? Thanks!
> Why would an Azure customer need to query this service at all? I was not aware this service even exists- because I never needed anything like it.
The "metadata service" is hardly unique to Azure (both GCP & AWS have an equivalent), and it is what you would query to get API credentials to Azure (/GCP/AWS) service APIs. You can assign a service account² to the VM¹, and the code running there can just auto-obtain short-lived credentials, without you ever having to manage any sort of key material (i.e., there is no bearer token / secret access key / RSA key / etc. that you manage).
I.e., easy, automatic access to whatever other Azure services the workload running on that VM requires.
¹and in the case of GCP, even to a Pod in GKE, and the metadata service is aware of that; for all I know AKS/EKS support this too
²I am using this term generically; each cloud provider calls service accounts something different.
I use GCP, but it also has the idea of a metadata server. When you use a Google Cloud library in your server code like PubSub or Firestore or GCS or BigQuery, it is automatically authenticated as the service account you assigned to that VM (or K8S deployment).
This is because the metadata server provides an access token for the service account you assigned. Internally, those client libraries automatically retrieve the access token and therefore auth to those services.
We run a significant amount of stuff on spot-instances (AKS nodes) and use the service detect, monitor and gracefully handle the imminent shutdown on the Kubernetes side.
Mainly for getting managed-identity access tokens for Azure APIs. In AWS you can call it to get temporary credentials for the EC2’s attached IAM role. In both cases - you use IMDS to get tokens/creds for identity/access management.
Client libraries usually abstract away the need to call IMDS directly by calling it for you.
Thank you, and everyone else who responded. So then this type of service seems to be used by other cloud providers (AWS). What makes this Azure service so much more insecure than its AWS equivalent?
Having it running on host (!), and the metadata for all guest VMs stored and managed by the same memory/service (!!), with no clear security boundary (!!!).
It's like storing all your nuke launch codes in the same vault, right in the middle of Washington DC national mall. Things are okay, until they are not okay.
This reads pretty bad, and I believe it was. I worked on (and was at least partly responsible for) systems that do the same thing he described. It took constant force of will, fighting, escalation, etc to hold the line and maintain some basic level of stability and engineering practice.
And I've worked other places that had problems similar to the core problems described, not quite as severe, and not at the same scale, but bad enough to doom them (IMO) to a death loop they won't recover from.
The personal account makes a lot of sense, although I could easily see why the OP was not successful. Even if you are an excellent engineer, making people do things, accept ideas, and in general hear you requires a completely different skill altogether - basically being a good communicator.
The second thing is that this series of blog posts (whether true or not, but still believable) provides a good introduction to vibe coders. These are people who have not written a single line of code themselves and have not worked on any system at scale, yet believe that coding is somehow magically "solved" due to LLMs.
Writing the actual code itself (fully or partially) maybe yes. But understanding the complexity of the system and working with organisational structures that support it is a completely different ball game.
I've worked on honing my communication skills for 20 years in this industry. Every time I have failed to get the desired result, I have gone back to the drawing board to understand how I can change how I'm communicating to better convey meaning, urgency, and all that.
After all that I've finally had an epiphany. They simply don't care. They don't care about quality, about efficiency, about security. They don't care about their users, their employees, they don't care about the long term health of the company. None of it. Engineers who do care will burn out trying to "do their job" in the face of management that doesn't care.
It's getting worse in the tech industry. We've reached the stage where leaders are in it only for themselves. The company is just the vehicle. Calls for quality fall on deaf ears these days.
> Even if you are an excellent engineer, making people do things, accept ideas, and in general hear you requires a completely different skill altogether - basically being a good communicator.
I was thinking like this for a while but, now, I think this expectation is majorly false for a senior individual contributor. Especially when someone who can push out a detailed series of blogposts and has tried step-wise escalation.
Communication is a two-way street. Unlike the individual contributors, the management is responsible for listening and responding to risk assesments by the senior members and also ensuring that the technical competence and experienced people are retained in a tech company. If a leader doesn't want to keep an open ear, they do not belong there. If there is a huge attrition of highly senior people from non-finalized projects, you do not belong leadership either. Both cases are mentioned in the article.
Unfortunately our socioeconomic and political culture in the West has increasingly removed responsibilities and liabilities from the leadership of the companies. This causes people with lackluster technical, communication and risk assesment mentality being promoted into leadership positions.
So outside of a couple completely privately owned companies or exceptionally well organized NGOs, it will be increasingly difficult to find good leaders.
OP was not successful because they didn't want to fix the problems he discussed. I have been in the same exact situation, and no level of communication skills would have been successful in changing their minds.
The truth is, only small companies build good stuff. Once a company becomes big enough, the main product that it originally started on is the only good thing that is worth buying from them - all new ventures are bound to be shit, because you are never going to convince people to break out of status quo work patterns that work for the rest of the company.
The only exception to this has been Google, which seems to isolate the individual sectors a lot more and let them have more autonomy, with less focus on revenue.
I did not get that impression at all. He mentioned quite a few conversations with partner level employees, technical fellow, principal managers.
The impression I got is he tried to fix things, but the mess is so widespread and decision makers are so comfortable in this mess that nobody wants to stick their necks out and fix things. I got strong NASA Challenger vibes when reading this story…
Axel's engagement with the issue and refusal to give up is admirable. It also demonstrates that code and architecture remain important even in an era when managers believe these subjects can now be handled by LLMs. Imagine if LLMs were mandated for use in such an environment, further distancing SWEs from the code and overarching architectural choices. I am not saying that it can't work. But friction and maturity through experience really matters.
Also explains perfectly why I never met an engineer who was eager to run workloads on Azure. In orgs I worked, either the use of Azure was mandated by management (probably good $$ incentives) or through Microsoft leaning into the "Multi-Cloud for resilience" selling point, to get Orgs shift workloads from competitors.
On a leadership level it seems problematic that they ghosted the feedback. Direcly this leads to people like Axel who feel ownership of the problem to break NDAs and create company harming posts.
In my experience they at least respond with corp speak platitudes meaning that they got the feedback and don't understand it or ignore it, but have been taught to always ask for feedback and answer it (but incentives are to ask for feedback, then ignore it).
To be honest, I don’t think this is “company harming”—what would be harming is Azure being pwned if they didn’t know and did nothing, or failing SLA at the wrong time. Now they know.
I had the misfortune of having to use Azure back in 2018 and was appalled at the lack of quality, slowness. I was in GitHub forums, helping other customers suffering from lack of basic functionality, incredible prices with abysmal performance. This article explains a lot honestly.
Google’s Cloud feels like the best engineered one, though lack of proper human support is worrying there compared to AWS.
Unless you work in Alphabet's marketing department, then no GCP isn't the best one. The most reliable cloud has always been AWS by a wide margin. The exec in charge of GCP has had to apologize in public on multiple occasions for GCP's reliability problems. Sounds like they have fixed them by now (years later) but that doesn't make up the disaster that was BigQuery.
Also, GCP is more focused on smaller customers so perhaps that's the part that works for you. AWS can be a bit daunting. But AWS actually versions their APIs and publishes roadmaps and timelines for when APIs get added and retired and what you should use instead. GCP will just cancel things on short notice with no replacement.
I thought that about GCP until I used it more seriously and kept running into issues where they didn’t have some feature AWS had had for ages, and our Google engineers kept saying the answer was to run your own service in Kubernetes rather than use a platform service which did not give me confidence that they understood what the business proposition was.
GCP's support is abysmal. Our assigned customer support agent has changed 3 times in as many months. it's really a dice roll if our quota increase requests are even acknowledged or we can get clarification on undocumented system limitations.
I was a career Microsoft stack developer until Azure. Comparing it to AWS immediately forced me to make a decision to move away from their stack and towards AWS.
Just the networking and security infrastructure was complete trash compared to how those things worked in AWS.
The only time I used Azure was for setting up Microsoft as a provider for authentication. Put me through a never-ending loop of asking for a Government of India issued document that was already submitted. Human support was non-existent. Decided never to use Azure in any product after that horrible experience.
If you cannot even get auth right I shudder to think what the rest of the product will be like to deal with should issues arise.
> Cutler’s intent was to produce a system with the same level of quality, unshakable reliability, and attention to detail he was famous for in his work on VMS and NT.
> was maintaining in-memory caches containing unencrypted tenant data, all mixed in the same memory areas, in violation of all hostile multi-tenancy security guidelines
Splitting caches to different isolated memory areas will not make shareholders happy, will not lead to promotion and will not even move the project forward.
Simply put, designing secure software is detrimental in that environment.
I tried to use Azure once (more than 5 years ago), and the signing up kept crashing on me for hours. Never used it again since then. Some things are obvious.
We run 1000s of machines in Azure. It's garbage. Very few features work. Nodes are always having strange issues, especially on the networking side. And the worst part is that Azure support has 0 interest in actually debugging things. We just got out of an outage today caused by the insanely slow SSDs that they attach to their postgres dbs by default.
> Worse, early prototypes already pulled in nearly a thousand third-party Rust crates, many of which were transitive dependencies and largely unvetted, posing potential supply-chain risks.
Rust really going for the node ecosystem's crown in package number bloat
Rust is nowhere close to Node in terms of package number bloat. Most Rust libraries are actually useful and nontrivial and the supply chain risk is not necessarily as high for the simple reason that many crates are split up into sub-crates.
For example, instead of having one library like "hashlib" that handles all different kinds of hashing algorithms, the most "official" Rust libraries are broken up into one for sha1, one for sha2, one for sha3, one for md5, one for the generic interfaces shared by all of them, etc... but all maintained by the same organization: https://github.com/rustcrypto/
Most crypto libraries do the same. Ripgrep split off aho-corastick and memchr, the regex crate has a separate pcre library, etc.
Maybe that bumps the numbers up if you need more than one algorithm, but predominantly it is still anti-bloat and has a purpose...
While i agree the exact line “rust libraries are useful and non-trivial” i have heard from all over the place as if the value of a library is how complex it is. The rust community has an elitist bent to it or a minority is very vocal.
Supply chain attacks are real for all package registries. The js ones had more todo with registry accounts getting hacked than the compromised libraries being bad or useless.
There is a difference between individual packages coming out of a single project (or even a single Cargo workspace) vs them coming out of completely different people.
The former isn't a problem, it is actually desirable to have good granularity for projects. The latter is a huge liability and the actual supply chain risk.
For example, Tokio project maintains another popular library called Prost for Protobufs. I don't think having those as two separate libraries with their own set of dependencies is a problem. As long as Tokio developers' expertise and testing culture go into Prost, it is not a big deal to have multiple packages. Similarly different components of the Tokio itself can be different crates, as long as they are built and tested together, them being separate dependencies is GOOD.
Now to use Prost with a gRPC server, I need a different project: tonic which comes from a different vendor: Hyperium. This is an increased supply chain risk that we need to vet. They use Prost. They also use the "h2" crate. Now, I need to vet the code quality and the testing cultute of multiple different organizations.
I have a firm belief that the actual People >>> code, tooling, companies and even licensing. If a project doesn't have (or retain) visionary and experienced developers who can instill good culture, it will ship shit code. So vetting organizations >> vetting indiviual libraries.
Power Platform is of the same quality, I’d avoid it if possible.
I was a principal engineer in the Power Platform org and it always felt like a disorganized mess. Multiple reorganizations per year, changing priorities and service ownership.
These days, at work, I need to support applications build on Azure and Power Platform. Both are a hot mess.
We get notifications that our APIM is down for at least 15min every weeks at random times.
Power Platform is just a "preview" mess, things break and are not functional.
I complained about it and basically was told to shut up, the industry is using them, so they must be right.
I was always very curious why people are using azure. Clunky difficult to setup and crazy prices. I know a person being very happy with them because of the credits they gave it to him. I felt I probably don't have a model that explains what is going on there and that would be cool to know why people pay them vs the competion
Well, part 3 at least explains something I've observed; the platform is incredibly unstable. The same calls, with the same parameters, will often randomly fail with HTTP 400 errors, only to succeed later(hopefully without involving support). That made provisioning with terraform a nightmare.
I won't even dive too much into all the braindead decisions. Mixing SKUs often isn't allowed if some components are 'premium' and others are not, and not everything is compatible with all instances. In AWS, if I have any EBS volume I can attach it to any instance, even if it is not optimal. There's no faffing about "premium SKUs". You won't lose internet connectivity because you attached a private load balancer to an instance. Etc...
At my company, I've told folks that are trying to estimate projects on Azure to take whatever time they spent on AWS or GCP and multiply by 5, and that's the Azure estimate. A POC may take a similar amount of time as any other cloud, but not all of the Azure footguns will show themselves until you scale up.
I’ve been working with Azure and Azure Germanyfor the past years and have a strong history with AWS.
I cannot count how many times disks were not attaching during AKS rescheduling. We build polling where we polled Entra Id for minutes until it became “eventually” consistent - not trusting a service principal until it was fetched at least one minute consistently. The slowness of Azure Functions was unbearable. On Azure germany IoT Hubs had to be “rebooted” by support constantly - which was a shocking statement in itself. The docs always lying or leaving out critical parts. The whole Premium vs Standard stuff is like selling windows licenses. The role model and UI is absolutely inconsistent.
The stability, consistency of IAM, and speed of AWS in comparison makes me truly wonder how anyone stays with Azure. One reason might be that the Windows instances are significantly cheaper though..
Thanks for that, now I have a rock-solid argument when people say "oh we're already Microsoft customers, we'll just use Azure, it's easier, and they have Active Directory!!"
My most memorable anecdote from working in Azure is that they had two products named Purview and the internal MS people I talked to never figured out which one I was trying to use.
I have been in a Microsoft adjacent company (meaning lots of people bounced to and from Microsoft to it) and all this makes a lot of sense. The almost ideological “everything in house” and politically oriented philosophy they had fits like a glove. Some of the ex Microsoft people hated it, some of them missed it. But the picture they made was pretty bleak.
Given how windows is going what’s described in the article doesn’t seem so shocking either. Even though they need not be correlated products, I can’t help but seeing a similar shortsightedness in the playbooks they are adopting.
As an investor, this is exact how I feel. Everything was skyrocketing until OpenAI “diversified” mid-2025. The company’s market value has dropped more than 1 trillion since late October 1025, so the title is factual. You can rightfully argue and be skeptical about the link I make, but not about the numbers :)
This read was a blast from the past. I'm not going to comment on much from OP and instead give a little of my experience there.
Straight out of college in 2017 I joined the Compute Fabric Controller (FC) org as a SWE on an absolutely wonderful team that dealt with mostly container management, VM and Host fault handling & repair policies, and Fabric to Host communication with most of our code in the FC. I drove our team's efforts on the never ending "Unhealthy" node workstream, the final catch-all bucket in the Host fault handler mentioned in OP. I also did heavy work in optimizing repair policies, reactive diagnostics for improved repairs and offline analysis, OS and HW telemetry ingestion from the Host like SEL events into the repair manager in real time, wrote the core repair manager state machine in the new AZ level service that we decoupled from the Fabric, drove Kernel Soft Reboot (KSR)/Tardigrade as a repair action for minimal VM impact for some host repairs, and helped stand up into eventually owning a new warm path RCA attribution service to help drive the root underlying causes of reliability issues and feed some offline analysis back into the live repair manager.
The work was difficult but also really really interesting. For example, Balancing repair policies around reliability is tricky. There's a constant fight in repair policies in grey situations between minimizing total VM downtime vs any VM interruptions/reboots/heals at all, because the repair controller doesn't have perfect information. If telemetry is pointing to VMs being degraded or down on the host, yet in reality they're not, we are the ones inducing the VM downtime by performing an impactful repair. If we wait a little while before taking an impactful repair action, it may be a transient issue that will resolve itself in the moment, at which point we can do much less impactful repairs after like Live Migration if the host is healthy enough. On the flipside, if some telemetry is saying the VMs are up yet they're down in reality and we just don't know it yet, taking time to collect diagnostics and then take a repair action(s) leads to only more overall total downtime.
When I joined in 2017 our team was 7 or 8 people including myself, yet had enough work for at least double that amount of people. On-call was a nightmare the first 2 years. Building Azure back then was like trying to build a car while already sitting behind the steering wheel of that car as it was already barreling down the highway. Everyone on my immediate team the first couple years were a joy to work with, highly competent, hard working, and all of us working absurd hours. For me 60hrs/wk was avg, with many weeks ~80 and a few weeks ~100. Other than the hours though, it was a splendid team environment and I'd like to think we had good engineering culture within our team, though maybe I'm biased. Engineering culture and quality did however vary substantially between orgs and teams. We were heavily under resourced and always needed more headcount, as did nearly every other team in Azure Compute. That never changed during my tenure even though my team's size ballooned to ~20 by 2020, and eventually big enough to where we had to split the team. There was high turnover from the lack of headcount and overwork which was somewhat alleviated by lowering the hiring bar... which obviously opened up another can of worms. This resourcing issue might explain, in part, why Azure is the way it is. We were always playing catchup as a result of the woes of chronic understaffing for years. I eventually burnt out which turned into spiraling mental health, physical health issues, constant panic attacks, and then a full blown mental health crisis after which I took LOA and eventually left the company. I came back briefly for a bit during LOA, and learned that the RCA service I'd built with the help of a coworker (who also left Azure) and was only a small part of our overall workload, had turned into a full fledged team of 9 people dedicated to working on that service in my absence. I know that stating some of this might affect my employment in the future but I don't really care. I know I'm not alone in experiencing burnout working in Azure. It wasn't my manager's fault either, he was amazing. He'd often ask and I would incorrectly yet confidently reassure him that I wasn't burning out but I simply didn't notice the signs. Things are better now though and I'm just happy to be here.
Kudos to the many brilliant people I worked alongside there, I hope you're all doing great.
The first and most important lesson, that I try to each every young developer starting in the industry: Go home after clocking in your hours negotiated in your contract. Drop your pen. Go home. Sleep well.
And I hope, that every sensible senior developer in here does the same. Lead by example. Maybe it would prevent a few burnouts in this industry.
And if you are a manager, then send your people home after they have clocked in their negotiated hours. For their own well-being. It’s your responsibility. And if it’s not working, then force them to go home.
I hope you are better by now and got through the tough time. All the best for you!
I just do not understand how Azure has the scale it does. You only need to login and click around for a bit to see this is not a coherent system designed by competent people. Let alone try and actually build something on it.
From my old experience in IT - people just default to Microsoft for everything. They don't want to hassle with learning anything else and assume better the devil you know. Glad I'm out of that world but its wild what people will put up with.
People and organizations that built things on top of Microsoft tech. Especially with a long history going back to NT times.
HN, YC, startup environment or academia is a Unix bubble. They all feed into each other. Especially because Linux is gratis which helped all of those to deploy projects/products/papers cheaply. Unix systems traditionally lack much of the upper layers, so it is the responsibility of the company, persons, developers to deal with the OS minutea. You need sysadmins, devops, SREs. Those are common roles again in this Unix bubble. The dependency chains here are usually flatter since it keeps mid-term costs lower.
Other organizations like governments and bigger orgs like banks prioritize having somebody else liable (i.e. they can blame) and they prefer to not hire technical competence in their orgs but rely on other companies. This is where Microsoft gets a lot of clients. You buy a bunch of server licenses. Your Microsoft support person installs them and installs IIS via GUI. And then you just upload your code every now and then. The OS updates, IIS server etc are all the responsibility of Microsoft and the middlemen companies. Minimal competence from the orginal org is required. There are multiple middlemen businesses who all give zero fucks about anything but whatever the immediate downstream from them. This is more usual in already publicly traded huge businesses. Moreover the investors actually mandate certain things that only this kind of layers of irresponsibility can deliver :) So you see this kind of switch happening towards IPOs.
Azure is the cloud labeling and forcing the first paradigm over the second paradigm for Microsoft products. It got lots of support because shareholders liked it. I don't think the original NT design and Microsoft's business model was bad, it actually worked very well. However, shareholders gonna shareholder. So they pushed hard for Microsoft and their clients to move to the "cloud". Microsoft executives saw the huge profit and share value potential of pushing Azure the brand too. It was the AI of 2010s afterall.
Because for some it works. At least I haven't heard the stories I see here yet at my workplace. Also I use some Azure, but apart from some weird UI bugs never had real big issues.
When things must be shipped quickly, shit breaks and corners are cut; large orgs are full of disfunction. Not sure if such insight was worth of setting your own career on fire.
> That entire 122-strong org was knee-deep in impossible ruminations involving porting Windows to Linux to support their existing VM management agents.
> My day-one problem was therefore not to ramp up on new technology, but rather to convince an entire org, up to my skip-skip-level, that they were on a death march.
> I later researched this further and found that no one at Microsoft, not a single soul, could articulate why up to 173 agents were needed to manage an Azure node
This is most corporates. I'm sure this was celebrated as as a successful project and congratulations to everyone, along with big bonuses, RSU, raises, and promotions, mostly to other orgs to bring this kind of 'success' to other projects (or other companies). These people mostly are gone in less than 2 years. They continue to take 'wins'.
The VPs are dumb as shit, but they need 'successful' projects that have fancy names that they can present to their exec team.
The 173 agents are to give wins to a large number of people and teams, all these people contributed to this successful project.
If it continues, there will be a lessons learned powerpoint, followed by 10x growth in headcount, promotions to everyone and double down. 270 people can deliver a baby in 1 day and all that.
> This group was now tasked with moving their inherited stack to the new Azure Boost accelerator environment, an effort Microsoft had publicly implied was well underway at Ignite conferences since 2023.
The goal is to attach your projects to something announced by the CEO and ride the career rocketship!
> Few engineers could reliably build the software locally; debugger usage was rare (I ended up writing the team's first how-to guide in 2024); and automated test coverage sat below 40%.
A key clue and explains why so much of what Microsoft puts out is garbage. Wow.
Microsoft Azure has always been a clown show. I've found so many obvious bugs. The quality is not there and never will be. No serious companies rely on it. Use virtually any other vendor or host it yourself.
i run fastapi APIs on linode with cloudflare in front and honestly the simplicity is underrated. predictable billing, docs that match reality, no surprise platform regressions. for a straightforward API workload the hyperscaler tax doesn't make sense unless you genuinely need their scale
At this point, it’s very clear that people nowadays choose Rust mostly to be part of the cult rather than clearly understanding its shortcomings and advantages over languages such as C, C++. It has gotten to the point that some devs after watching a YouTube video criticizing C++ for two hours, announce C++ the worst programming language. Unfortunately, such people become decision makers at giant tech companies too.
Great but then you tie your growth to the support people headcount. Normally you would see enormous costs upfront for R&D and bringing the thing up, then marginal costs when adding capacity (the hardware, mostly)—if capacity is proportional to the number of humans looking after the system, you will soon hit a limit, and the cost won’t look good either.
I've said it before and I'll say it again. I'm glad rust has good package management I really am. However given that aspect, it ends up forming a dependency heavy culture. In situations like this it's hard to use dependencies due to the amount of transitive dependencies some items pull in. I really which this would change. Of course this is a social problem so I don't expect a good answer to come of this....
Any complex system - and these cloud systems must be immensely complex - accumulate cruft and bloat and bugs until the entire thing starts to look like an old hotel that hasn’t been renovated in 30 years.
It’s not inevitable. Absolutely this is true without significant effort, but if you’ve been around the traps for long enough (in enough organisations), you get to see that the level of quality can vary widely. Avoiding the mud-pit does require a whole org commitment, starting from senior leadership.
This story is more interesting, in my opinion, in how quickly things devolved and also how unwilling the more senior layers of the org were to address it. At a whole company level, the rot really sets in when you start to lose the key people that built and know the system. That seems to be what’s happening here, and it does not bode well for MS in the medium term.
The comment comes from the input field on the post form. Not clear it would show up as a comment. The old thread you refer to had little to do with Microsoft per se. Let me known if I can help with the inconsistencies you mention?
> Why do you speak about yourself in the third person?
When you submit a link to HN, there is an entry field for text in addition to the url.
It does not really describe what the text is used for. For links, the content of that field is simply added as the first comment.
Someone who is unfamiliar with the submission process may assume this field should describe what they are submitting, and not format it like a comment.
Then that text gets posted as the first comment and tons of people downvote it, jumping to the conclusion that the weird summary comment is from an AI, and not the submitter describing their own submission.
(I also assumed these comments were AI until someone else pointed this out)
I downvoted this comment for sounding like a summarizing LLM, not adding anything substantial beyond the title of the post, before realizing you were the poster and author.
What's your assessment of AWS and GCP? Do you think it's likely they suffer from some of the same issues (eg the manual access of what should be highly secure, private systems, the instability, the lack of security)?
As a former GCP engineer, no, the systems are not generally unstable or insecure.
There is definitely manual access of data - it requires what was termed “break glass” similar to the JIT mechanism described by the author. However, it wasn’t quite so loose; there were eventually a lot of restrictions on who could approve what, what access you got after approval, and how that was audited.
It was difficult to get into the highest sensitivity data; humans reviewed your request and would reject it without a clear reason. And you could be 100% sure humans would review your session afterwards to look for bad behavior.
I once had to compile a large list of IP addresses that accessed a particular piece of data to fulfill a court order. It took me days of effort to get and maintain the elevated access necessary to do this.
I have a lot of respect for GCP as an engineering artifact, but a significantly less rosy opinion of GCP as an organization and bureaucratic entity. The amount of wasted effort expended on engaging with and navigating the bureaucracy is truly mind-boggling, and is the reason why a tiny feature that took a day to code could take months to release.
The answer to your question is in the public releases. MS went from primary partner (under ROFR) to one of the options. They retain IP rights and API hosting, although in recent weeks we learned that OpenAI was planning a workaround with AWS and Microsoft said they might sue them for that. So the happy marriage is over, it’s more like a custody battle now: https://www.reuters.com/technology/microsoft-weighs-legal-ac...
His writing style is fairly over the top (he is Swiss, and I have seen this before, but not most of the time), but most of the technical content seems true to me.
I work on a heavy azure shop. Talking about thousands of subscriptions and over 70000 compute resources. I gave up on Azure Portal. It goes crazy after I activate my access to the big management groups. It just doesn't work so I started writing my own scripts using the python cli and api.
We also face lots of weird errors and less than stellar support from Microsoft. I'm not going deep into these because I don't want to identify my employer but one of my biggest sources of headaches nowadays is AMPLS. I have no idea who came up with that overcomplicated and prone to failure design.
I think this is especially problematic (from Part 4 at https://isolveproblems.substack.com/p/how-microsoft-vaporize...):
"The team had reached a point where it was too risky to make any code refactoring or engineering improvements. I submitted several bug fixes and refactoring, notably using smart pointers, but they were rejected for fear of breaking something."
Once you reach this stage, the only escape is to first cover everything with tests and then meticulously fix bugs, without shipping any new features. This can take a long time, and cannot happen without the full support from the management who do not fully understand the problem nor are incentivized to understand it.
This isn't incentivized in corporate environment.
Noticed how "the talent left after the launch" is mentioned in the article? Same problem. You don't get rewarded for cleaning up mess (despite lip service from management) nor for maintaining the product after the launch. Only big launches matter.
The other corporate problem is that it takes time before the cleanup produces measurable benefits and you may as well get reorged before this happens.
This is the root of the issue. For something like Azure, people are nor fungible. You need to retain them for decades, and carefully grow the team, training new members over a long period until they can take on serious responsibilities.
But employees are rewarded for showing quick wins and changing jobs rapidly, and employers are rewarded for getting rid of high earners (i.e. senior, long-term employees).
> For something like Azure, people are nor fungible
What I've learned from a decade in the industry is that talent is never fungible in low-demand areas. It's surprisingly hard to find people that "get it" and produce something worthwhile together.
I would say "systems design" rather than low-demand.
People who can "reduce" a big system to build on a few simple concepts are few and far between. Most people just add more stuff instead.
What is a low-demand area?
A geographic area where there's not abundant opportunity for software developers. Usually everywhere outside the major metro areas. It was primarily meant to discount experiences from SF or Seattle where I'm sure finding talent is easy enough, assuming you are willing to pay.
Its a cool talent filter though, if you higher people the set of people that quit on doomed projects and how fast they quit is a real great indicator of technological evaluation skills.
No joke, I worked at a place where in our copy of system headers we had to #define near and far to nothing. That was because (despite not having supported any systems where this was applicable for more than a decade) there was a set of files that were considered too risky to make changes in that still had dos style near and far pointers that we had to compile for a more sane linear address space. https://www.geeksforgeeks.org/c/what-are-near-far-and-huge-p...
Now, I'm just a simple country engineer, but a sane take on risk management probably doesn't prefer de facto editing files by hijacking keywords with template magic compared with, you know just making the actual change, reviewing it, and checking it in.
Once you reach this stage, the only escape is to jump ship. Either mentally or, ideally, truly.
You're in an unwinnable position. Don't take the brunt for management's mistakes. Don't try to fix what you have no agency over.
unfortunately, what you will find is that unless you get lucky, the next ship is more of the same.
The system/management style is ingrained in corporate culture of large-ish companies (i would say if it has more than 2 layers of management from you to someone owning the equity of the business and calling the shots, it's "large").
It stems from the fact that when an executive is bestowed the responsibility of managing a company from the shareholders, the responsibility is diluted, and the agent-principle problem rears their ugly head. When several more layers of this starts growing in a large company, the divergence and the path of least resistance is to have zero trust in the "subordinates", lest they make a choice that is contrary to what their managers want.
The only way to make good software is to have a small, nimble organization, where the craftsman (doing the work) makes the call, gets the rewards, and suffers the consequences (if any). That aligns the agent-principle together.
Hierachy is the enemy of succeding projects and information flow. The more important and complex hierarchy in a culture the less likely it is to have a working software industry. Germanys and japanese endless :"old vs young, seniority vs new, internal vs external, company wide management vs project local management come to mind. Its guerilla vs army, startup vs company allover..
> Once you reach this stage, the only escape is to first cover everything with tests and then meticulously fix bugs
The exact same approach is recommended in the book "Working effectively with legacy code" by Michael Feathers, with several techniques on how to do it. He describes legacy code as 'code with no tests'.
> I submitted several bug fixes and refactoring, notably using smart pointers, but they were rejected for fear of breaking something.
And that, my friends, is why you want a memory safe language with as many static guarantees as possible checked automatically by the compiler.
Hence the rewrite-it-in-Rust initiative, presumably. Management were aware of this problem at some level but chose a questionable solution. I don't think rewriting everything in Rust is at all compatible with their feature timelines or severe shortages of systems programming talent.
I was waiting for that comment :) Remember that everybody, eventually, calls into code written in C.
Depends on which OS we are talking about.
I know a few where that doesn't hold, including some still being paid for in 2026.
If 90% of the code I run is in safe rust (including the part that's new and written by me, therefore most likely to introduce bugs) and 10% is in C or unsafe rust, are you saying that has no value?
Il meglio è l'inimico del bene. Le mieux est l'ennemi du bien. Perfect is the enemy of good.
That is an unexpected interpretation. Use the best tool for the job, also factoring what you (and your org) are comfortable with.
If you're sufficiently stubborn, it's certainly possible to call directly into code written in Verilog, held together with inscrutable Perl incantations.
High-level languages like C certainly have their place, but the space seems competitive these days. Who knows where the future will lead.
If you want something extra spicy, there are devices out there that implement CORBA in silicon (or at least FPGA), exposing a remote object accessible using CORBA
You didn’t miss the smiley, did you? :)
Once you reach this stage, honestly the only escape is real escape. Put your papers in and start looking for a job elsewhere, because when they go down, they will go down hard and drag you with them. It's not like you didn't try.
writing tests and then meticulously fixing bugs does not increase shareholders' value.
Or to simplify the product and rebuild.
Exactly. But he’s right about management, first the problem must be acknowledged and that may make some people look bad.
once you reach the stage, the only escape is to give up on it. and move on.
somethings are beyond your control and capabilities
if the service is so shitty, why are people paying so much fucking money for it?
is microsoft committing an accounting fraud?
I worked at a startup that was using Azure. The reason was simple enough - it had been founded by finance people who were used to Excel, so Windows+Office was the non-negotiable first bit of IT they purchased. That created a sales channel Microsoft used to offer generous startup credits. The free money created a structural lack of discipline around spending. Once the startup credits ran out, the company became faced with a huge bill and difficulty motivating people to conserve funds.
At the start I didn't have any strong opinion on what cloud provider to use. I did want to do IT the "old fashioned way" - rent a big ass bare metal or cloud VM, issue UNIX user accounts on it and let people do dev/test/ad hoc servers on that. Very easy to control spending that way, very easy to quickly see what's using the resources and impose limits, link programs to people, etc. I was overruled as obviously old fashioned and not getting with the cloud programme. They ended up bleeding a million dollars a month and the company wasn't even running a SaaS!
I ended up having a very low opinion of Azure. Basic things like TCP connections between VMs would mysteriously hang. We got MS to investigate, they made a token effort and basically just admitted defeat. I raged that this was absurd as working TCP is table stakes for literally any datacenter since the 1980s, but - sad to say - at this time Azure's bad behavior was enabled by a widespread culture of CV farming in which "enterprise" devs were all obsessed with getting cloud tech onto their LinkedIn. Any time we hit bugs or stupidities in the way Azure worked I was told the problem was clearly with the software I'd written, which couldn't be "cloud native", as if it was it'd obviously work fine in Azure!
With attitudes like that completely endemic outside of the tech sector, of course Microsoft learned not to prioritize quality.
The US government’s experts called Azure “a pile of shit”; they got overruled.
https://www.propublica.org/article/microsoft-cloud-fedramp-c...
Because Azure customers are companies that still, in 2026 only use Windows. Anyone else uses something else. Turns out, companies like that don't tend to have the best engineering teams. So moving an entire cloud infrastructure from Azure to say AWS, probably is either really expensive, really risky or too disruptive to do for the type of engineering team that Azure customers have. I would expect MS to bleed from this slowly for a long time until they actually fix it. I seriously doubt they ever will but stranger things have happened.
Turns out outside companies shipping software products aspiring to be the next Google or Apple, most companies that work outside software industry also need software to run their business and they couldn't care less about HN technology cool factor.
They use whatever they can to ship their products into trucks, outsourcing their IT and development costs , and that is about it.
CFOs love it because Microsoft does bundle pricing with office. Plus they love to give large credits to bootstrap lock-in.
most the upper management of companies who use them have dont have the technical competence to see it. (eg: banks, supermarket chains, manufacturing companies)
once they are in, no one likes to admit they made a mistake.
It’s more of a hostage situation.
Because the alternatives are also in similar state.
AWS or GCP are all pretty crap. You use any of them, any you'll hit just enough rough edges. The whole industry is just grinding out slop, quality is not important anywhere.
I work with AWS on a daily basis, and I'm not really impressed. (Also nor did GCP impress me on the short encounter I had with it)
This reads like it was written by the cleverest person in the room. I have to use Azure Devops at work, and some of the critique of Azure rings true, but the author-centric presentation was quite off-putting.
I don't know if any of this is true, but as a user of Azure every day this would explain so much.
The Azure UI feels like a janky mess, barely being held together. The documentation is obviously entirely written by AI and is constantly out of date or wrong. They offer such a huge volume of services it's nearly impossible to figure out what service you actually want/need without consultants, and when you finally get the services up who knows if they actually work as advertised.
I'm honestly shocked anything manages to stay working at all.
I’ve created a bunch of fresh Azure accounts over the past few years and each time I’ve found myself sitting there dumbfounded anew at how garbage the experience is.
There has been weird broken jank at just about every step of the process at one point or another. Like, I’m a serious person trying to set something up for a production workload, and multiple times along the way to just having a working account that I can log into with billing configured, I’ll get baffling error messages like [ServiceKeyDepartureException: Insufficient validation expectancy. Sfhtjitgfxswinbvgtt-33-664322888], and the whole thing will simply not work until several hours later. Who knows why!?
I evaluated some Azure + Copilot Studio functionality for a project recently, which required more engagement with their whole 365 ecosystem than I’d had in a long time and it had many of the same problems but worse. Just unbelievably low quality software for the price and how popular it is. Every step of the way I hit some stupid issue. The people using this stuff are clearly not the people buying it.
I've joked that on some services, when you're clicking buttons, you're actually opening tickets that a human needs to action.
That scenario is an example. You complete an action on a web page and nothing works. You make no further changes and hours later it works perfectly. Your human wasn't fast enough that day.
That's the "digital escort" process mentioned in the very long OP. Understandably, the US government got mad when they found out that cheap Chinese tech support staff were being used for direct intervention on "secure" VMs.
That's not what the "problem" was. It's that cheap American support people were "escorting" foreign Microsoft SWEs, so they could manage and fix services they wrote and were the subject matter experts for in the sovereign cloud instances which they otherwise would have no access to.
And this was NOT for the government clouds we have that hold classified data. Those are air-gapped clouds that physically cannot be accessed by anyone who doesnt have a TS clearance and physically go into a SCIF.
source: I work in a team very closely related the team who designed digital escort.
I would definitely fight against calling anything I work on „digital escort”.
Yes but this misses the underlying point: this is the same software. It suffers from the same defects. If your management stack keeps crashing and leaking VMs you are seeing a reduction in the operational capacity of the fleet. If you are still there just tour Azure Watson and tell me if you’d want the military to rely on that system in wartime? Don’t forget things like IVAS and God knows what else that are used during operations while Azure node agents happily crash and restart on the hosts. The system should be no-touch and run like an appliance, which is predicated on zero crashes or 100% crash resiliency. In Windows Core we pursued a single Watson bucket with a single hit until it was fixed. Different standards.
I'm only commenting on parent comment's understanding of what digital escort process is specifically. Escort is used by all kinds of teams that are just doing day-to-day crap for various resource providers across azure. I've never worked anywhere close to Azure Core so I don't know about these more low-level concerns. Overall I agree and sympathize with your assessment of the engineering culture.
You also make it sound like getting a JIT approved is getting keys to the kingdom. It's not -- every team has it's own JIT policies for their resources. Should there be far less manual touches? Ideally. But JIT is better than persistent access at least, and JIT policies should be scoped according to principle of least privilege. If that is not happening, it's a failure at the level of that specific org.
Policies vary. The node folks get access to the nodes and the fabric controller by necessity.
I guess we agree on the point where it should not be necessary, which echoes Cutler’s original intent of “no operational intervention.”
This is not an impossible task, after all it’s just user-mode code calling into platform APIs.
200 requests a day, lol
> I've joked that on some services, when you're clicking buttons, you're actually opening tickets that a human needs to action.
I just experienced one startup where the buttons just happen to only work during business hours on the US west coast.
> when you're clicking buttons, you're actually opening tickets that a human needs to action
I had one public cloud vendor sales literally admit this was the case with their platform. But they were now selling "the new one" which is supposed to be better.
It was, a lot. But only compared to the old one.
I remember being impressed with the Azure docs... until I spend a week implementing something, only to have it completely fail when deployed to the test environment because GraphAPI did not work as documented. The beautiful docs were a complete lie.
These days I don't even bother looking at the docs when doing stuff with Azure.
I can’t count the number of times the docs have been totally wrong.
And they were actually like that pre-LLM, in 2019, when I was implementing stuff for a car company on azure. They spent _hundreds of thousands_ on cosmosDB, for less performance than a raspberry pi running Postgres.
Pretty surprised to hear this. I would think (assuming they are LLM written as parent suggests), that MS could throw a large context "pro" LLM at the code base and you should get perfect docs, updated every release?
More perfect than a person where I might mistakenly copy/paste or write "Returns 404" but the LLM can probably see actually return a 401.
I'm not a stranger to LLMs hallucinating things in responses but I'd always assumed that disappeared when you actually pointed it at the source vs some nebulous collection of "knowledge" in a general LLM.
Is it your first time using an LLM? No, they generate plausible-sounding bullshit no matter the input. Sometimes that bullshit is useful. Other times it isn't.
I’ve worked with their consultants and they were lovely. They hate Azure too.
I imagine that no one likes Azure.
Azure container apps are a great (idea) and work mostly fine as long as you don’t need to touch them. But they’re just like GCR or what fargate should be - container + resources and off you go.
We ran many internal workloads on ACA, but we had _so may issues_ with everything else around ACA…
Only C level likes Azure
We use Azure for desktops and we pay $600/month for 4 cores, getting performance comparable to a $60 Intel N100 chip.
The only good thing Microsoft azure ever did for me was provide a very easy way to exploit their free trial program in the early 2010s to crypto mine for free. It couldn’t do much, but it was straight up free real estate for CPU mining. $200 or 2 weeks per credit/debit card.
Ah, I did the same, but wasn't the experience/UI back then pretty nice too?
I haven't used azure since then, but I remember the web interface was way more polished than aws and things worked ok (spinning up a VM was fast etc).
So I'm confused by how everyone seems to hate it now.
I used it for MMO goldfarming - circa 2012/2013
Damn that’s impressive. Wasn’t it all command line at the time?
Yeah no shade on the consultants. I’ve worked with some good ones too.
The part about prioritizing "aggressive feature velocity" over "core fundamentals" is true.
The push is as insane as push to AI.
At the same time fundamental improvements like migrating to .net core, or reducing logs is actively deprioritised. If it were not for compliance, we would not have any core engineering improvement at all
Honestly, I was not even aware of rust push, probably cause no one in my org could do rust. I am glad we did not move to AKS though
Question: how would you compare it to AWS or GCP?
We migrated some services to AKS because the upper management thought it was a good deal to get so many credits, and now pods are randomly crashing and database nodes have random spikes in disk latency. What ran reliably on GCP became quite unpredictable.
Exact same story at my place. Upper management decided it's a good idea to build on Azure because Microsoft promised some benefits. Things that ran reliable on GCP now need active firefighting on Azure
Interesting! We're using AKS with huge success so far, but lately our Pods are unresponsive and we get 503 Gateway Timeouts that we really can't trace down. And don't get me started on Azure Blob Tables...
In our case this was only a month ago, and now we're stuck because management thought it was a good idea to sign a hefty spend commitment.
In our case, we spent to much time of engineer time just to put up with Azure but there’s no good ROI. It took sometime for the upper management to realize Azure is shit and cut the cost
Don't they have an SLA? You can break that open if they don't perform.
To what end? I've never seen an SLA which is clear cut enough to be worth pursuing if you want more than a free t-shirt.
> I've never seen an SLA which is clear cut enough to be worth pursuing if you want more than a free t-shirt.
I have, regularly. I am not sure what kind of business you are running but parties that rely on service providers for critical (primary business process driving) components routinely agree to SLAs with large penalties and the ability to open up an existing contract in case of non-performance. Obviously you would have to be willing to pay for such a service in the first place otherwise there is no point in setting up an SLA, this won't be cheap. But we're definitely not talking about 'free t-shirts' here, more about direct liability, per hour penalties and so on.
I'm thinking ISPs, colo, cloud.
By the time SLA thresholds are being breached you've been through months (or years) of pain. They're not strong enough or specific enough to save you from a bad provider. ymmv
Colo and cloud providers that provide real SLAs exist. But they're pricey because they tend to insure against breach of that that SLA and they pass on the cost of that insurance. If you're a run-of-the-mill e-commerce company then it probably doesn't make much sense. But if you yourself are providing critical services to others and they have you by the short hairs in case you don't perform you better make sure that you're not going to end up holding the bag.
One simple example: energy market services, 15 minute ahead and day ahead markets require participants to have the ability to perform or they will be penalized severely, to the point where they can lose that access, the damage of which could easily be in the 10's of millions to 100's of millions depending on their size. Asset owners and utilities both would be able to hit them hard if they do not perform, the asset owners for lost income and the utilities for both government penalties and possibly for outages and all associated costs. These are not the kind of contracts you enter into lightly.
Exactly what I was thinking. But then again, from what I've seen, the persons responsible for monitoring uptimes are often much further removed from the C suite in these "committed-spend" companies.
Gcp is hard to beat on k8s stuff. Performance and stability is crazy good.
But it's not aws are famous and costs money. Hence moving away seems like a good idea :)
A business man at a prior employer sympathetic with my younger, naive "Microsoft sucks" attitude told me something I remember to this day:
Microsoft is not a software company, they have never been experts at software. They are experts at contracts. They lead because their business machine exceeds at understanding how to tick the boxes necessary to win contract bids. The people who make purchasing decisions at companies aren't technical and possibly don't even know a world outside Microsoft, Office, and Windows, after all.
This is how the sausage is made in the business world, and it changed how I perceived the tech industry. Good software (sadly) doesn't matter. Sales does.
This is why most of Norway currently runs on Azure, even though it is garbage, and even though every engineer I know who uses it says it is garbage. Because the people in the know don't get to make the decision.
My lesson was when European companies followed US tech into offshoring, and how quality doesn't play any role as long as the software delivers, from business point of view.
Especially relevant when shipping software isn't the product the company sells.
But that also means that if you as a user/customer can make choices based on technical merits, you'll have a significant advantage.
An advantage how? Maybe you'll have one or two more 9s of uptime than your competitors; does that actually move the needle on your business?
Why wouldn't it move the needle? Less time spent, less frustration, more performance, more resources focused on the business?
The biggest expense in software is maintenance. Better software means cheaper maintenance. If you actually want to have a significant cost advantage, software is the way to go. Sadly most business is about sales and marketing and has little to do with the cost or quality of items being sold.
It will depend on each case and what makes the marketed solution inferior. If it's overly complex and you will save development time. If it's unstable you'll save debugging time. If it's bloated you will save on hardware costs. Etc...
matters less than we would like it to
after all startups/scaleups/bigtech companies that make a lot of money can run on Python for ages, or make infinite money with Perl scripts (coughaws)
and it matters even less in non-tech companies, because their competition is also 3 incompetent idiots on top of each other in a business suite!
sure, if you are starting a new project fight for good technical fundamentals
Most customers don't really have the knowledge needed to make choices based on technical merits, and that's why the market works as it does. I'm willing to say 95% of people on HN have this knowledge and are therefore biased to assume others are the same way. It's classic XKCD 2501.
What are we reading here? These are extraordinary statements. Also with apparent credibility. They sound reasonable. Is this a whistleblower or an ex employee with a grudge? The appearance is the first. Is it? They’ve put their name to some clear and worrying statements.
> On January 7, 2025… I sent a more concise executive summary to the CEO. … When those communications produced no acknowledgment, I took the customary step of writing to the Board through the corporate secretary.
Why is that customary? I have not come across it, and though I have seen situations of some concern in the past, I previously had little experience with US corporate norms. What is normal here for such a level of concern?
More, why is this public not a court case for wrongful termination?
Is Azure really this unreliable? There are concrete numbers in this blog. For those who use Azure, does it match your external experience?
>Is Azure really this unreliable? There are concrete numbers in this blog. For those who use Azure, does it match your external experience?
IME, yes.
I'm currently working as an SRE supporting a large environment across AWS, Azure, and GCP. In terms of issues or incidents we deal with that are directly caused by cloud provider problems, I'd estimate that 80-90% come from Azure. And we're _really_ not doing anything that complicated in terms of cloud infrastructure; just VMs, load balancers, some blob storage, some k8s clusters.
Stuff on Azure just breaks constantly, and when it does break it's very obvious that Azure:
1. Does not know when they're having problems (it can take weeks/months for Azure to admit they had an outage that impacted us)
2. Does not know why they had problems (RCAs we're given are basically just "something broke")
3. Does not care that they had problems
Everyone I work with who interacts with Azure at all absolutely loathes it.
As a former MSFTy it does sound weird to me too. I didn’t see what Axels level was but a lot of people work for Microsoft and not many of them can expect to email the CEO and get a response. It seems a bit like a crash out, not the first I’ve seen levied at Azure, won’t be the last. They probably think it’s a mental health episode, if you’re an important CEO crazy people will email you all the time and the staff probably filter them out before they see it. Also this is a lot of internal gossip, I would be worried that airing this publicly would impinge on future career opportunities, even healthy orgs would appreciate some discretion.
I’m sure everything he said is completely true, Azure is one of the few tech stacks I refuse to work with and the predominant reason I left.
If you’ve joined an org and nothing works the reason is usually that the org is dysfunctional and there is often very little you can do about it, and you’re probably not the first person who’s tried and failed at it.
Never worked at a FAANG, but from what I read from their cultures I don't think a letter to the CEO from a senior engineer would go entirely unnoticed there. CEO's might receive crazy letters, but hopefully not regularly from their senior engineering staff..
In my experience Azure is full of consistency issues and race conditions. It's enough of an issue that I was talking about new OpenAI models becoming available via Bedrock on AWS and how convenient that was since I wouldn't have to deal with Azure and my colleague in enterprise architecture went on an unprompted rant about these exact issues. It's not the first time something like this has happened and I've experienced these issues first hand, so yes. I'd say reliability is a critical issue for Azure and it hasn't gotten better each time I've gone back to check.
I recall seeing some pretty damning reports from a security pentester that was able to escape from a container on Azure and found the management controller for the service was years old with known critical unpatched vulnerabilities. Always been a bit sceptical of them since then
A decent portion of Azure Web Apps internals hasn't moved past .net core 2.3
Large orgs make decisions that prioritize short-term metrics over long-term quality all the time and nobody tracks whether those tradeoffs actually paid off. The decision to ship fast and fix later sounds reasonable in a meeting setting until articles like this surface and the reality comes through clearly.
> sounds reasonable in a meeting setting until articles like this surface
No. It sounds reasonable past that. Because shipping features will make shareholders happy while an article like this will change nothing.
What I meant is that it’s customary to write to the Board through the Secretary as opposed to write directly or through some other channel.
I am sort of confused how NDA and such agreements employees sign would allow for an employee to post such an article without being sued by Microsoft?
Wild guess, touching this with a 10-foot pole risks validating his claims. If they sue for breach of NDA, it means his claims are factually correct, and if they sue for libel and it goes to court, they may be forced to submit documents they don't want to.
Most likely, the author was let go in mass layoff, and they forgot about NDA.
NDAs are usually signed when you join the company, not leave it.
Signing a non-disparagement agreement is often a condition for receiving severance, although I'm not sure what MSFT's policy on this is.
If they can swing it as legit whistleblowing somehow, they might be ok.
Interesting point. Time will tell.
> What are we reading here? These are extraordinary statements. Also with apparent credibility.
I left Microsoft in 2014. Already back then I could see this sort of stuff starting to happen.
The Office Org was mostly immune from it because they had a lot of lifers, people who had been working on the same code for decades and who thought through changes slowly.
But even by 2014 there were problems hiring developers who knew C++, or who wanted to learn it. COM? No way. One one team we literally had to draw straws once to determine who was going to learn how to write native code for Windows.
It wasn't even a talent thing, Windows development skills are a career dead end outside of Microsoft. They used to be a hot commodity, and Microsoft was able to hire the best of the best from industry. Now they have to train people up, and Microsoft doesn't offer any of the employment perks that they used to use to attract top talent (Seattle used to be a low CoL area, everyone had private offices, job stability).
When I started at Microsoft in 2007, the interview bar included deep knowledge of how computers worked. It wasn't unusual to have meetings drop down to talking about assembly code. Your first day after orientation was a bunch of computer parts and you were told to "figure out how to setup your box".
Antivirus wasn't mandatory. The logic was if you got a virus, they made a mistake hiring you and you deserved to be fired.
When your average developer can go that deep on any topic, you can generally leave engineers well enough alone and get good software.
Antivirus wasn’t mandatory in 2007 after the 2003 Blaster Worm, that required no user action to compromise the PC? Wild
On the other hand there was e.g. CVE-2021-1647 where Microsoft's antivirus would compromise the PC with no user action.
(At least I think that's the one I'm thinking of. It's marked as a high-severity RCE with no user interaction but they don't give any details. There was definitely at least one CVE where Windows Defender compromised the system by unsafely scanning files with excessive privileges.)
Maybe they fired everyone who was working there in 2003. Would explain some things.
Yeah I thought that was extreme. An engineer going to the board of any corporation let alone Microsoft is not normal or customary IME. That could explain why they got no response.
Not on day one. Imagine it took two years to get there.
Yes it is that unreliable. Even when given free credits, I would rather pay for the offerings from Amazon/Google.
The CEO is accountable to the board. If they are derelict in their obligations to the company, that's where you need to raise a stink so they can fix it.
Well, yeah, that’s what a board does, but I think the issue is whether it is customary to go to the board directly in this situation. The answer is a resounding NO. Very odd, but cool idea and approach.
Maybe naive, but why not? If it's a serious enough issue, and you're not getting anywhere through your management chain all the way up to the CEO, why is it novel to contact the people the CEO reports to? They're not royalty, they're other human beings who also eat, piss and fart like everyone else.
Before 6 years of Google I’d co-sign what you said, but it never ever plays out that way.
The law of the jungle is an iron law, make people around you feel bad, be a tattletale, and you’re choosing to be ostracized.
That said yr interlocutor disturbs me a bit because yes, they certainly will make it out to be a mental health episode. But the implicit deal there is “STFU. You can even take paid health leave.” It’s not healthy either. BigCo is insane I’ll never work for one again without outrageous comp.
You’d be stunned by even the simplest story. Ex. a year in some crazy shit was going down and my manager asked for my thoughts on a topic, I was honest and basically said “I don’t think it’s a good idea, but in my experience, raising issues involving people only raises more issues.” He swore up and down it wouldn’t be a problem, eventually made a deal I could email it to him privately. Next 1:1 with my area lead was horrible, them seeing red, hearing a mistranslated version of what I said, and I had 0 warning.
I guess you're in the US?
In Europe I speak up all the time, even to people who are not in Europe.
(Usual disclaimed that this is my opinion.)
He is, I think, Swiss, perhaps a cultural difference?
We like things well done, but also integrity and accountability.
Azure is when you have a different version of the same product/api in each region.
The post is so dramatized and clearly written by someone with a grudge such that it really detracts from any point that is trying to be made, if there is any.
From another former Az eng now elsewhere still working on big systems, the post gets way way more boring when you realize that things like "Principle Group Manager" is just an M2 and Principal in general is L6 (maybe even L5) Google equivalent. Similarly Sev2 is hardly notable for anyone actually working on the foundational infra. There are certainly problems in Azure, but it's huge and rough edges are to be expected. It mostly marches on. IMO maturity is realizing this and working within the system to improve it rather than trying to lay out all the dirty laundry to an Internet audience that will undoubtedly lap it up and happily cry Microslop.
Last thing, the final part 6 comes off as really childish, risks to national security and sending letters to the board, really? Azure is still chugging along apparently despite everything being mentioned. People come in all the time crying that everything is broken and needs to be scrapped and rewritten but it's hardly ever true.
>risks to national security and sending letters to the board, really?
Yes, really, and guess what the DoD did on Aug 29, 2025, exactly 234 days after I warned the CEO of potential risks?
https://www.propublica.org/article/microsoft-china-defense-d...
It wasn’t specifically about the escort sessions from any particular country, though, but about the list of underlying reasons why direct node access was necessary.
> Last thing, the final part 6 comes off as really childish, risks to national security and sending letters to the board, really?
That struck me too. Maybe i've never worked high enough in an org (im unclear how highly ranked the author of the piece is) but i've never been in an org where going over your boss's boss's boss's boss's head and writing a letter to the board was likely to go well.
That said, i could easily believe that both Azure is an absolute mess and that the author of the piece was fired because of how he went about things.
I didn’t say it went well. Actually I said it didn’t go well :(
I work in azure. It’s a mess, but what large system isn’t. Now extrapolate the to one of the biggest systems in existence.
The only reason a low level employee like OP is emailing satya is because they have a personality disorder or are having a psychotic break, which is pretty clear from OP’s manifesto
Lol, no.
It is true that writing to the board will get you noticed, and that you might not like the consequences. If you value having the job then don’t write to the board. Even if you are right, being noticed like that isn’t going to endear you to your boss.
But if you care more about doing the right thing then writing to the board is the right thing to do. And after a few years of working at Microsoft you might not value your job very much either and you too might decide to go out in style.
Go watch the last episode of Chernobyl again.
Windows is ~500 times bigger than Azure, give or take, by machine count, and still many times larger by loc, modules, users, whatever else you want to measure. The heavy lifting (VM/containers, I/O, the things that cannot not be done just like that) is handled by the Windows folks anyway. The only hard part is the VM placement, everything else is mostly regular software engineering, some of medium-hard complexity but nothing that can excuse the need for constant human intervention.
Thanks for the free psychology assessment, I appreciate it, but I believe I’m fine. The series omits lots of details.
Hi, I hope you are doing good. From my personal experience, complaining about your manager to skip level manager is called Career Suicide.
There is nothing good that can come out of it,, except getting fired.
It is, but “Microsoft runs on trust” they say. They also say the CEO’s inbox is always open, actually the CEO himself says it in the yearly mandatory training video on business conduct. So it should be safe, in theory, to openly speak out in the best interest of the customers, no? Rhetorical question :)
Don't believe everything people say. Watch what they do.
By the way, are you not worried about NDAs and such?
from a philosophy grad. both these responses are logical fallacies.
1: it's bad, but so is everything else (ad populum, everyone does it so it's ok).
2: it can only be because the author has a personality disorder or psychotic break (ad hominem)
It reminded me of this one:
https://wtfmitchel.medium.com/how-to-get-fired-from-microsof...
A lot of similarities, except the medium author was not part of PG but support. He also had recently suffered a brain injury.
Before or after publishing his article?
like 5 minutes after.
Redacted to avoid getting doxxed (my original reply showed disdain for the parent comment and agreed with Axel's writing).
Former 1010 Overlake RnD here too :)
AWS and Google Cloud are both huge and are significantly better in UX/DX. My only experience with Azure was that it barely worked, provided very little in the way of information about why it didn't. I only have negative impressions of Azure whereas at least GC and AWS I can say my experiences are mixed.
> From another former Az eng now elsewhere still working on big systems, the post gets way way more boring when you realize that things like "Principle Group Manager" is just an M2 and Principal in general is L6 (maybe even L5) Google equivalent. Similarly Sev2 is hardly notable for anyone actually working on the foundational infra.
Before the days of title inflation across the industry, a a Principal at Microsoft was a rare thing. When I was there, the ratio was maybe 1 principal for every 30 developers. Principals were looked up to, had decades of experience, and knew their shit really well. They were the big guns you called in to fix things when the shit really hit the fan, or when no one else could figure out what was going on.
I believe the author was referring to this https://www.propublica.org/article/microsoft-digital-escorts....
Microsoft hired Chinese engineers to manage US Department of Defense Azure VMs.
Thanks. That reference is correct. The point is why those sessions were necessary because there is no reason, a-priori, to do manual touches on production systems, DoD or not.
Yes it's easy to critique any large system or organisation, to then go over everyone's head and cry to the CEO and Board is snake like behaviour especially offering you self as the answer to fix it. OP will be marked as a troublemaker and bad team member.
Maybe. That would be a dent in the shiny culture of trust Microsoft is proud to run on, though.
> risks to national security
Microsoft is the go to solution for every government agency, FEDRAMP / CMMC environments, etc.
> People come in all the time crying that everything is broken and needs to be scrapped and rewritten but it's hardly ever true.
This I'm more sympathetic to. I really don't think his approach of "here's what a rewrite would look like" was ever going to work and it makes me think that there's another side to this story. Thinking that the solution is a full reset is not necessarily wrong but it's a bit of a red flag.
At no point during the reading I got sense that he's suggesting something radical. Where specifically is he pointing out rewrite?
"The practical strategy I suggested was incremental improvement... This strategy goes a long way toward modernizing a running system with minimal disruption and offers gradual, consistent improvements. It uses small, reliable components that can be easily tested separately and solidified before integration into the main platform at scale." [1]
[1] https://isolveproblems.substack.com/p/how-microsoft-vaporize...
> The current plans are likely to fail — history has proven that hunch correct — so I began creating new ones to rebuild the Azure node stack from first principles.
> A simple cross-platform component model to create portable modules that could be built for both Windows and Linux, and a new message bus communication system spanning the entire node, where agents could freely communicate across guest, host, and SoC boundaries, were the foundational elements of a new node platform
Yes, I read that part as well and found it a bit confusing to reconcile with this one.
The vibe from my quotes is very much "I had a simple from-scratch solution". They mention then slowly adopting it, but it's very hard to really assess this based on just the perspective of the author.
He also was making suggestions about significantly slowing down development and not pursuing major deals, which I think again is not necessarily wrong but was likely to fall on deaf ears.
Interesting point. The two stances are not contradictory. The end result is a new stack, so you are right saying that was the intent. However how you get there on a running system is through stepwise improvements based on componentization and gradual replacement until everything is new. Each new component clears more ground. I never imagined an A/B switch to a brand new system rewritten from scratch.
> Microsoft is the go to solution for every government agency, FEDRAMP / CMMC environments, etc.
I've been involved with FEDRAMP initiatives in the past. That doesn't mean as much as you'd think. Some really atrocious systems have been FEDRAMP certified. Maybe when you go all the way to FEDRAMP High there could be some better guardrails; I doubt it.
Microsoft has just been entrenched in the government, that's all. They have the necessary contacts and consultants to make it happen.
> Thinking that the solution is a full reset is not necessarily wrong but it's a bit of a red flag.
The author does mention rewriting subsystem by subsystem while keeping the functionality intact, adding a proper messaging layer, until the remaining systems are just a shell of what they once were. That sounds reasonable.
> I've been involved with FEDRAMP initiatives in the past. That doesn't mean as much as you'd think. Some really atrocious systems have been FEDRAMP certified. Maybe when you go all the way to FEDRAMP High there could be some better guardrails; I doubt it.
I never said otherwise. I said that Microsoft services are the defacto tools for FEDRAMP. I never implied that those environments are some super high standard of safety.
> Microsoft has just been entrenched in the government, that's all.
Yes, this is what I was saying.
> The author does mention rewriting subsystem by subsystem while keeping the functionality intact, adding a proper messaging layer, until the remaining systems are just a shell of what they once were. That sounds reasonable.
It sounds reasonable, it's just hard to say without more insight. We're getting one side of things.
Thanks. That was exactly the plan. Full rewrites are extremely risky (see the 2nd System syndrome) as people wrongly assume they will redo everything and also add everything everyone always wanted, and fix all dept, and do it in a fraction of the time, which is delusional and almost always fail. Stepwise modernization is a proven technique.
As someone who had worked adjacent to the functionally-same components (and much more) at your biggest competitor, you have my sympathy.
Running 167 agents in the accelerator? My gawd that would never fly at my previous company. I'd get dragged out in front of a bunch of senior principals/distinguished and drawn and quartered.
And 300k manual interventions per year? If that happened on the monitoring side , many people (including me) would have gotten fired. Our deployment process might be hack-ish, but none of it involved a dedicated 'digital escort' team.
I too have gotten laid off recently from said company after similar situation. Just take a breath, relax, and realize that there's life outside. Go learn some new LLM/AI stuff. The stuff from the last few months are incredible.
We are all going to lose our jobs to LLM soon anyway.
FedRAMP means nothing. It’s a checkbox. National security stuff has a different standard.
It "means nothing" that the way that government systems get set up for government data is all using Microsoft tooling?
I think he did kind of point at the lack of seniority in the org, so I'm not sure he was trying to exaggerate with the titles.
I'm really struck that they have such Jr people in charge of key systems like that.
I've worked at both Microsoft and Google in the past 6 years and the notion that msft "Principal" is equivalent to goog L5 is crazy.
Meaning Msft Principal is below L5? I got the same feedback from one of my friends who works at Google. She said quality of former MSFT engineers now working at Google was noticeably lower.
I mean if you go by pay in the UK a Microsoft principle is equivalent to an L4 at Google if levels.fyi is too be believed....
> risks to national security …really?
Really. Apparently the Secretary of War agrees with him.
In fairness the SECWAR is hardly a computing expert.
But in this case the SECWAR has been properly advised. If anything it's astonishing that a program whereby China-based Microsoft engineers telling U.S.-based Microsoft engineers specific commands to type in ever made it off the proposal page inside Microsoft, accelerated time-to-market or not.
It defeats the entire purpose of many of the NIST security controls that demand things like U.S.-cleared personnel for government networks, and Microsoft knew those were a thing because that was the whole point to the "digital escort" (a U.S. person who was supposed to vet the Chinese engineer's technical work despite apparently being not technical enough to have just done it themselves).
Some ideas "sell themselves", ideas like these do the opposite.
> If anything it's astonishing that a program whereby China-based Microsoft engineers telling U.S.-based Microsoft engineers specific commands to type in ever made it off the proposal page inside Microsoft, accelerated time-to-market or not.
> It defeats the entire purpose of many of the NIST security controls that demand things like U.S.-cleared personnel for government networks, and Microsoft knew those were a thing because that was the whole point to the "digital escort" (a U.S. person who was supposed to vet the Chinese engineer's technical work despite apparently being not technical enough to have just done it themselves).
That is beyond bad. Proof of this?
https://www.propublica.org/article/microsoft-digital-escorts...
Holy fuck. Ok, this will change things considerably for some companies I'm working with that had moved their stuff to Azure. Thanks. More than I can express on here.
Being compliant with the letter of the requirements at 1/3 of the cost is absolutely an idea that sells itself.
I'd like to suggest calling him SECDEF, not SECWAR.
IMHO the country should not capitulate to Trump's power grabs, even if Congress refuses to perform their oversight duties.
I'm sympathetic to the viewpoint but I'm not in the habit of policing the names people use for themselves.
I've certainly done more than my fair share of jobs in the Navy where the office I was formally billeted to had long since ceased to actually exist as described due to office renamings. Often things as simple as a department section being elevated into a department branch and people using the new name even while they wait 1-2 years for the manpower records to be fixed and the POM process to cycle through for program resourcing. But still, seems hard to treat it as a crime at one level when no one blinked an eye at the lower level.
Maybe Congress will eventually step in, but in the meantime the American voters made their choice about who they want to run these agencies, so...
The main title of the office is still “secretary of defense”, the executive order added a secondary title of the department and the office, it didn't replace the primary titles.
We could call him by what he does: SECMASSMURDERER
The United States does not have a Secretary of War, and has not since 1947.
To be fair, it's not like Hegseth is a super high-signal source. Hegseth says lots of stuff, some of which are even true!
This was such a genuinely weird moment for me when reading the article.
"yadda yadda and then also the secretary of defence agreed it was bad"
I'm just reading along and going, "yeah that sounds really bad if a secretary level position is being cited... wait a second, isn't that actually the guy who is literally famous for being stupid??"
I never expected to be living through a real life version of "the emperor's new clothes", like, how is anyone quoting this guy about anything?
For reference, author was a Senior Software Engineer, ie. mid-level engineer.
> People come in all the time crying that everything is broken and needs to be scrapped and rewritten but it's hardly ever true.
Or… you’ve just normalised the deviation.
One of the few reliable barometers of an organisation (or their products) is the wtf/day exclaimed by new hires.
After about three or four weeks everyone adapts, learns what they can and can’t criticise without fallout, and settles into the mud to wallow with everyone else that has become accustomed to the filth.
As an Azure user I can tell you that it’s blindingly obvious even from the outside that the engineering quality is rock bottom. Throwing features over the fence as fast as possible to catch up to AWS was clearly the only priority for over a decade and has resulted in a giant ball of mud that now they can’t change because published APIs and offered products must continue to have support for years. Those rushed decisions have painted Azure into a corner.
You may puff your chest out, and even take legitimate pride in building the second largest public cloud in the world, but please don’t fool yourself that the quality of this edifice is anything other than rickety and falling apart at the seams.
Remind me: can I use IPv6 safely yet? Does it still break Postgres in other networks? Can azcopy actually move files yet, like every other bulk copy tool ever made by man? Can I upgrade a VM in-place to a new SKU without deleting and recreating it to work around your internal Hyper-V cluster API limitations? Premium SSDv2 disks for boot disks… when? Etc…
You may list excuses for these quality gaps, but these kinds of things just weren’t an issue anywhere else I’ve worked as far back as twenty years ago! Heck, I built a natively “all IPv6” VMware ESXi cluster over a decade ago!
> One of the few reliable barometers of an organisation (or their products) is the wtf/day exclaimed by new hires.
Eh, I don't think this is exactly as reliable as you'd expect.
My previous job had a fairly straight forward code base but had fairly poor reliability for the few customers we had, and the WTF portions usually weren't the ones that caused downtime.
On the other hand, I'm currently working on a legacy system with daily WTFs from pretty much everyone, with a greater degree of complexity in a number of places, and yet we get fewer bug reports and at least an order of magnitude if not two more daily users.
With all of that said... I don't think I've used any of Microsoft's new software in years and thought to myself "this feels like it was well made."
Chugging along? Very clear you're not a customer using Azure.
He might sound like he has a grudge but you sound like you’re personally invested. Shill?
I've seen Azure OpenAI leak other customer's prompt responses to us under heavy load.
https://x.com/DaveManouchehri/status/2037001748489949388
Nobody seems to care.
This is insane, when you say azure OpenAI, do you mean like github copilot, microsoft copilot, hitting openai’s api, or some openai llm hosted on azure offering that you hit through azure? This is some real wild west crap!
The latter, their arrangement with OpenAI enabled this.
I have noticied a similar bug on Copilot. I noticed a chat session with questions that I had no recollection of asking. I wonder if it's related. I brushed it off as the question was generic.
Should be a high severity incident if data isoation has failed anywhere. And that is for SaaS let alone cloud provider.
Did you anomomize those? Did Azure dox them or send the templated version?
Hope that person with the chest pain went to the doctor
That is absolutely insane.
Yeah, I saw over 100 leaked messages.
Fun ones include people trying to get GPT to write malware.
It's a nice read. Thank you for sharing this.
> Microsoft, meanwhile, conducted major layoffs—approximately 15,000 roles across waves in May and July 2025 —most likely to compensate for the immediate losses to CoreWeave ahead of the next earnings calls.
This is what people should know when seeing massive layoffs due to AI.
I honestly thought this was one of the weaker points of the article.
The OpenAI deal almost certainly related purely to GPU capacity, which had little to do with the article. The layoffs would have happened regardless.
IMO - churn, and generalization is the root cause. Engineers are thrown on projects for a year with little prior experience, leave others to pickup the pieces, etc. There's no longer a sense of ownership, and I'm sure the recent wave of layoffs isn't helping with this.
GPUs is something that can be fixed simply by throwing money at it.
"For fiscal 2025, Microsoft CEO Satya Nadella earned total pay of $96.5 million, up 22% from a year earlier." -CNBC.com
and
"I also see I have 2 instances of Outlook, and neither of those are working." -Artemis II astronaut
https://bsky.app/profile/did:plc:jzhiqz7fb5dj6h7cydluryvn/po...
for anyone else who hasn't seen it
> 2 instances of Outlook
That's 2 too many.
They should have used the third outlook they didn't know about... Outlook, Outlook (new), and the well-hidden Outlook (classic) that actually works.
Well "Outlook (new)" finally stopped OOM-ing on my very normal-sized inbox, so I went back to using it over Outlook Classic... Can't say I notice a difference much these days.
(Not a residential inbox, the "I work in IT" sized inbox with all the email alerts about jobs failing...)
That outlook was part of the ablative outlook armor thats suppose to burn off on reentry
Do you have a source for that? I don’t see what impact consumer email software would have with the composition of the heat shield.
I believe this one would fall under "incongruity theory"
https://en.wikipedia.org/wiki/Theories_of_humor#Incongruity_...
He's saying it's bulky junk that's best torched.
its a joke, no sources required
Some previous colleague of mine has to work with Azure on their day to day, and everything explained in this article makes a lot of sense when I get to hear about their massive rantings of the platform.
12 years ago I had to choose whether to specialize myself in AWS, GCP or Azure, and from my very brief foray with Azure I could see it was an absolute mess of broken, slow and click-ops methodology. This article confirms my suspicions at that time, and my colleague experience.
What makes anyone start a new project and think “I know, I’ll use Azure!”? I really don’t get it. Do they have a great sales org? Is it because a phb thinks “well they made Office so it must be good”?
I interviewed with a Dutch energy company migrating infra from AWS -to- Azure and I have no idea what would make them do that (aside from inertia, but then why use Azure in the first place?)
And for some reason Azure usage is rampant in Europe.
In some places the purchasing decisions are not made by technical people. The infrastructure team gets azure budget and that's what they have to work with.
At my work the sales people regularly come to us with some azure discount they got offered on linkedin or some event. Luckily I have the power to tell them to fuck off.
A lot of enterprise orgs are completely helpless without Microsofts' identity solutions. That's what makes it easy to just adopt more and more Microsoft products.
Companies coming from Active Directory and Office.
At one startup I was in, Azure sales proactively reached out to the CEO on LinkedIn and then we were urged to swap off to it.
Where I live (New Zealand) Microsoft is a much larger percentage of IT infrastructure than say Bay Areas startups.
Companies are already used to working with Microsoft. Building on Microsoft's cloud feels natural.
They give free credit to startups if you fill in a few forms.
so does AWS and GCP... but pretty bad if that's the deciding criteria.
At the startup I worked at in 2023, Azure was considered the only “safe” way to use OpenAI APIs in prod (eg agreements that the data couldn’t be used for training).
Working with Azure was one of the worst parts of that job.
> The direct corollary is that any successful compromise of the host can give an attacker access to the complete memory of every VM running on that node. Keeping the host secure is therefore critical.
> In that context, hosting a web service that is directly reachable from any guest VM and running it on the secure host side created a significantly larger attack surface than I expected.
That is quite scary
It is kind of a fundamental risk of IMDS, the guest vms often need some metadata about themselves, the host has it. A hardened, network gapped service running host side is acceptable, possibly the best solution. I think the issue is if your IMDS is fat and vulnerable, which this article kind of alludes to.
There’s also the fact that azure’s implementation doesn’t require auth so it’s very vulnerable to SSRF
You could imagine hosting the metadata service somewhere else. After all there is nothing a node knows about a VM that the fabric doesn’t. And things like certificates comes from somewhere anyway, they are not on the node so that service is just cache.
Hosting IMDS on the host side is pretty much the only reasonable way to provide stability guarantees. It should still work even if the network is having issues.
That being said, IMDS on AWS is a dead simple key-value storage. A competent developer should be able to write it in a memory-safe language in a way that can't be easily exploited.
Ah yes great point, awesome article by the way —- thought provoking, shocking, really crazy stuff. Hopefully some good comes of it, godspeed.
This is well documented: https://learn.microsoft.com/en-us/azure/virtual-machines/ins...
Why would an Azure customer need to query this service at all? I was not aware this service even exists- because I never needed anything like it. AFAI can tell, this service tells services running on the VM what SKU the VM is. But how is this useful to the service? Any Azure users could tell how they use IMDS? Thanks!
> Why would an Azure customer need to query this service at all? I was not aware this service even exists- because I never needed anything like it.
The "metadata service" is hardly unique to Azure (both GCP & AWS have an equivalent), and it is what you would query to get API credentials to Azure (/GCP/AWS) service APIs. You can assign a service account² to the VM¹, and the code running there can just auto-obtain short-lived credentials, without you ever having to manage any sort of key material (i.e., there is no bearer token / secret access key / RSA key / etc. that you manage).
I.e., easy, automatic access to whatever other Azure services the workload running on that VM requires.
¹and in the case of GCP, even to a Pod in GKE, and the metadata service is aware of that; for all I know AKS/EKS support this too
²I am using this term generically; each cloud provider calls service accounts something different.
I use GCP, but it also has the idea of a metadata server. When you use a Google Cloud library in your server code like PubSub or Firestore or GCS or BigQuery, it is automatically authenticated as the service account you assigned to that VM (or K8S deployment).
This is because the metadata server provides an access token for the service account you assigned. Internally, those client libraries automatically retrieve the access token and therefore auth to those services.
We run a significant amount of stuff on spot-instances (AKS nodes) and use the service detect, monitor and gracefully handle the imminent shutdown on the Kubernetes side.
https://learn.microsoft.com/en-us/azure/virtual-machines/lin...
There is a bunch of things a VM needs when first starting from a standard image. Think certificates and a few other things.
Mainly for getting managed-identity access tokens for Azure APIs. In AWS you can call it to get temporary credentials for the EC2’s attached IAM role. In both cases - you use IMDS to get tokens/creds for identity/access management.
Client libraries usually abstract away the need to call IMDS directly by calling it for you.
Thank you, and everyone else who responded. So then this type of service seems to be used by other cloud providers (AWS). What makes this Azure service so much more insecure than its AWS equivalent?
Thanks again!
[edited phrasing]
Having it running on host (!), and the metadata for all guest VMs stored and managed by the same memory/service (!!), with no clear security boundary (!!!).
It's like storing all your nuke launch codes in the same vault, right in the middle of Washington DC national mall. Things are okay, until they are not okay.
What happens when someone asks an AI model to fuzz test that...
Managed identity is enabled via that endpoint, for example.
To have a new vm configure itself at boot
Scary is the understatement of the day. I can't imagine the environment where someone think that architecture is a good idea.
Instead of zero trust, it is 110% trust.
Like, what did the OP expect?
This reads pretty bad, and I believe it was. I worked on (and was at least partly responsible for) systems that do the same thing he described. It took constant force of will, fighting, escalation, etc to hold the line and maintain some basic level of stability and engineering practice.
And I've worked other places that had problems similar to the core problems described, not quite as severe, and not at the same scale, but bad enough to doom them (IMO) to a death loop they won't recover from.
The personal account makes a lot of sense, although I could easily see why the OP was not successful. Even if you are an excellent engineer, making people do things, accept ideas, and in general hear you requires a completely different skill altogether - basically being a good communicator.
The second thing is that this series of blog posts (whether true or not, but still believable) provides a good introduction to vibe coders. These are people who have not written a single line of code themselves and have not worked on any system at scale, yet believe that coding is somehow magically "solved" due to LLMs.
Writing the actual code itself (fully or partially) maybe yes. But understanding the complexity of the system and working with organisational structures that support it is a completely different ball game.
I disagree.
I've worked on honing my communication skills for 20 years in this industry. Every time I have failed to get the desired result, I have gone back to the drawing board to understand how I can change how I'm communicating to better convey meaning, urgency, and all that.
After all that I've finally had an epiphany. They simply don't care. They don't care about quality, about efficiency, about security. They don't care about their users, their employees, they don't care about the long term health of the company. None of it. Engineers who do care will burn out trying to "do their job" in the face of management that doesn't care.
It's getting worse in the tech industry. We've reached the stage where leaders are in it only for themselves. The company is just the vehicle. Calls for quality fall on deaf ears these days.
This will explain it too you:
https://www.youtube.com/watch?v=rStL7niR7gs&list=PLInW-j_Odl...
> Even if you are an excellent engineer, making people do things, accept ideas, and in general hear you requires a completely different skill altogether - basically being a good communicator.
I was thinking like this for a while but, now, I think this expectation is majorly false for a senior individual contributor. Especially when someone who can push out a detailed series of blogposts and has tried step-wise escalation.
Communication is a two-way street. Unlike the individual contributors, the management is responsible for listening and responding to risk assesments by the senior members and also ensuring that the technical competence and experienced people are retained in a tech company. If a leader doesn't want to keep an open ear, they do not belong there. If there is a huge attrition of highly senior people from non-finalized projects, you do not belong leadership either. Both cases are mentioned in the article.
Unfortunately our socioeconomic and political culture in the West has increasingly removed responsibilities and liabilities from the leadership of the companies. This causes people with lackluster technical, communication and risk assesment mentality being promoted into leadership positions.
So outside of a couple completely privately owned companies or exceptionally well organized NGOs, it will be increasingly difficult to find good leaders.
OP was not successful because they didn't want to fix the problems he discussed. I have been in the same exact situation, and no level of communication skills would have been successful in changing their minds.
Even before vide coding this problem existed.
The truth is, only small companies build good stuff. Once a company becomes big enough, the main product that it originally started on is the only good thing that is worth buying from them - all new ventures are bound to be shit, because you are never going to convince people to break out of status quo work patterns that work for the rest of the company.
The only exception to this has been Google, which seems to isolate the individual sectors a lot more and let them have more autonomy, with less focus on revenue.
Absolutely textbook "Brilliant Jerk". Dude just whines and whines and whines. If you're so good, why can't you get anybody to work with you?
I did not get that impression at all. He mentioned quite a few conversations with partner level employees, technical fellow, principal managers.
The impression I got is he tried to fix things, but the mess is so widespread and decision makers are so comfortable in this mess that nobody wants to stick their necks out and fix things. I got strong NASA Challenger vibes when reading this story…
My read is he was not Sr enough in the org to drive any effort to improve things, and could not get someone who was to do it either.
Axel's engagement with the issue and refusal to give up is admirable. It also demonstrates that code and architecture remain important even in an era when managers believe these subjects can now be handled by LLMs. Imagine if LLMs were mandated for use in such an environment, further distancing SWEs from the code and overarching architectural choices. I am not saying that it can't work. But friction and maturity through experience really matters.
Also explains perfectly why I never met an engineer who was eager to run workloads on Azure. In orgs I worked, either the use of Azure was mandated by management (probably good $$ incentives) or through Microsoft leaning into the "Multi-Cloud for resilience" selling point, to get Orgs shift workloads from competitors.
Its also huge case for open (cloud) stack(s).
On a leadership level it seems problematic that they ghosted the feedback. Direcly this leads to people like Axel who feel ownership of the problem to break NDAs and create company harming posts. In my experience they at least respond with corp speak platitudes meaning that they got the feedback and don't understand it or ignore it, but have been taught to always ask for feedback and answer it (but incentives are to ask for feedback, then ignore it).
To be honest, I don’t think this is “company harming”—what would be harming is Azure being pwned if they didn’t know and did nothing, or failing SLA at the wrong time. Now they know.
I had the misfortune of having to use Azure back in 2018 and was appalled at the lack of quality, slowness. I was in GitHub forums, helping other customers suffering from lack of basic functionality, incredible prices with abysmal performance. This article explains a lot honestly.
Google’s Cloud feels like the best engineered one, though lack of proper human support is worrying there compared to AWS.
Unless you work in Alphabet's marketing department, then no GCP isn't the best one. The most reliable cloud has always been AWS by a wide margin. The exec in charge of GCP has had to apologize in public on multiple occasions for GCP's reliability problems. Sounds like they have fixed them by now (years later) but that doesn't make up the disaster that was BigQuery.
Also, GCP is more focused on smaller customers so perhaps that's the part that works for you. AWS can be a bit daunting. But AWS actually versions their APIs and publishes roadmaps and timelines for when APIs get added and retired and what you should use instead. GCP will just cancel things on short notice with no replacement.
I thought that about GCP until I used it more seriously and kept running into issues where they didn’t have some feature AWS had had for ages, and our Google engineers kept saying the answer was to run your own service in Kubernetes rather than use a platform service which did not give me confidence that they understood what the business proposition was.
GCP's support is abysmal. Our assigned customer support agent has changed 3 times in as many months. it's really a dice roll if our quota increase requests are even acknowledged or we can get clarification on undocumented system limitations.
For some reason, MS is still doing well. I’m not sure what conclusions I should draw from that, other than big businesses are hard to kill?
I was a career Microsoft stack developer until Azure. Comparing it to AWS immediately forced me to make a decision to move away from their stack and towards AWS.
Just the networking and security infrastructure was complete trash compared to how those things worked in AWS.
Not one regret in my decision.
The only time I used Azure was for setting up Microsoft as a provider for authentication. Put me through a never-ending loop of asking for a Government of India issued document that was already submitted. Human support was non-existent. Decided never to use Azure in any product after that horrible experience.
If you cannot even get auth right I shudder to think what the rest of the product will be like to deal with should issues arise.
> Cutler’s intent was to produce a system with the same level of quality, unshakable reliability, and attention to detail he was famous for in his work on VMS and NT.
I'm not sure whether this is serious or irony.
> was maintaining in-memory caches containing unencrypted tenant data, all mixed in the same memory areas, in violation of all hostile multi-tenancy security guidelines
Splitting caches to different isolated memory areas will not make shareholders happy, will not lead to promotion and will not even move the project forward.
Simply put, designing secure software is detrimental in that environment.
I tried to use Azure once (more than 5 years ago), and the signing up kept crashing on me for hours. Never used it again since then. Some things are obvious.
We run 1000s of machines in Azure. It's garbage. Very few features work. Nodes are always having strange issues, especially on the networking side. And the worst part is that Azure support has 0 interest in actually debugging things. We just got out of an outage today caused by the insanely slow SSDs that they attach to their postgres dbs by default.
from part 2:
> Worse, early prototypes already pulled in nearly a thousand third-party Rust crates, many of which were transitive dependencies and largely unvetted, posing potential supply-chain risks.
Rust really going for the node ecosystem's crown in package number bloat
Rust is nowhere close to Node in terms of package number bloat. Most Rust libraries are actually useful and nontrivial and the supply chain risk is not necessarily as high for the simple reason that many crates are split up into sub-crates.
For example, instead of having one library like "hashlib" that handles all different kinds of hashing algorithms, the most "official" Rust libraries are broken up into one for sha1, one for sha2, one for sha3, one for md5, one for the generic interfaces shared by all of them, etc... but all maintained by the same organization: https://github.com/rustcrypto/
Most crypto libraries do the same. Ripgrep split off aho-corastick and memchr, the regex crate has a separate pcre library, etc.
Maybe that bumps the numbers up if you need more than one algorithm, but predominantly it is still anti-bloat and has a purpose...
I am sensing a "is-odd" and "is-even" vibes from that approach.
While i agree the exact line “rust libraries are useful and non-trivial” i have heard from all over the place as if the value of a library is how complex it is. The rust community has an elitist bent to it or a minority is very vocal.
Supply chain attacks are real for all package registries. The js ones had more todo with registry accounts getting hacked than the compromised libraries being bad or useless.
It really is about time that somebody do something about it.
Start with tokio. Please vend one dependency battery included, and vendor in/internalize everything, thanks.
There is a difference between individual packages coming out of a single project (or even a single Cargo workspace) vs them coming out of completely different people.
The former isn't a problem, it is actually desirable to have good granularity for projects. The latter is a huge liability and the actual supply chain risk.
For example, Tokio project maintains another popular library called Prost for Protobufs. I don't think having those as two separate libraries with their own set of dependencies is a problem. As long as Tokio developers' expertise and testing culture go into Prost, it is not a big deal to have multiple packages. Similarly different components of the Tokio itself can be different crates, as long as they are built and tested together, them being separate dependencies is GOOD.
Now to use Prost with a gRPC server, I need a different project: tonic which comes from a different vendor: Hyperium. This is an increased supply chain risk that we need to vet. They use Prost. They also use the "h2" crate. Now, I need to vet the code quality and the testing cultute of multiple different organizations.
I have a firm belief that the actual People >>> code, tooling, companies and even licensing. If a project doesn't have (or retain) visionary and experienced developers who can instill good culture, it will ship shit code. So vetting organizations >> vetting indiviual libraries.
Power Platform is of the same quality, I’d avoid it if possible.
I was a principal engineer in the Power Platform org and it always felt like a disorganized mess. Multiple reorganizations per year, changing priorities and service ownership.
These days, at work, I need to support applications build on Azure and Power Platform. Both are a hot mess. We get notifications that our APIM is down for at least 15min every weeks at random times. Power Platform is just a "preview" mess, things break and are not functional.
I complained about it and basically was told to shut up, the industry is using them, so they must be right.
No one is testing anything anymore.
What were the issues behind "APIM down"?
It's a bit astounding to realize Ballmer was good.
I was always very curious why people are using azure. Clunky difficult to setup and crazy prices. I know a person being very happy with them because of the credits they gave it to him. I felt I probably don't have a model that explains what is going on there and that would be cool to know why people pay them vs the competion
In my experience Azure endpoint versus openAI endpoint was way faster and significantly cheaper.
Well, part 3 at least explains something I've observed; the platform is incredibly unstable. The same calls, with the same parameters, will often randomly fail with HTTP 400 errors, only to succeed later(hopefully without involving support). That made provisioning with terraform a nightmare.
I won't even dive too much into all the braindead decisions. Mixing SKUs often isn't allowed if some components are 'premium' and others are not, and not everything is compatible with all instances. In AWS, if I have any EBS volume I can attach it to any instance, even if it is not optimal. There's no faffing about "premium SKUs". You won't lose internet connectivity because you attached a private load balancer to an instance. Etc...
At my company, I've told folks that are trying to estimate projects on Azure to take whatever time they spent on AWS or GCP and multiply by 5, and that's the Azure estimate. A POC may take a similar amount of time as any other cloud, but not all of the Azure footguns will show themselves until you scale up.
I’ve been working with Azure and Azure Germanyfor the past years and have a strong history with AWS.
I cannot count how many times disks were not attaching during AKS rescheduling. We build polling where we polled Entra Id for minutes until it became “eventually” consistent - not trusting a service principal until it was fetched at least one minute consistently. The slowness of Azure Functions was unbearable. On Azure germany IoT Hubs had to be “rebooted” by support constantly - which was a shocking statement in itself. The docs always lying or leaving out critical parts. The whole Premium vs Standard stuff is like selling windows licenses. The role model and UI is absolutely inconsistent.
The stability, consistency of IAM, and speed of AWS in comparison makes me truly wonder how anyone stays with Azure. One reason might be that the Windows instances are significantly cheaper though..
Thanks for that, now I have a rock-solid argument when people say "oh we're already Microsoft customers, we'll just use Azure, it's easier, and they have Active Directory!!"
So this is why GitHub is having so many problem…
My most memorable anecdote from working in Azure is that they had two products named Purview and the internal MS people I talked to never figured out which one I was trying to use.
Astronauts have the same problem now.
I have been in a Microsoft adjacent company (meaning lots of people bounced to and from Microsoft to it) and all this makes a lot of sense. The almost ideological “everything in house” and politically oriented philosophy they had fits like a glove. Some of the ex Microsoft people hated it, some of them missed it. But the picture they made was pretty bleak.
Given how windows is going what’s described in the article doesn’t seem so shocking either. Even though they need not be correlated products, I can’t help but seeing a similar shortsightedness in the playbooks they are adopting.
Title: How Microsoft Vaporized a Trillion Dollars
As an investor, this is exact how I feel. Everything was skyrocketing until OpenAI “diversified” mid-2025. The company’s market value has dropped more than 1 trillion since late October 1025, so the title is factual. You can rightfully argue and be skeptical about the link I make, but not about the numbers :)
OK, *2025
This read was a blast from the past. I'm not going to comment on much from OP and instead give a little of my experience there.
Straight out of college in 2017 I joined the Compute Fabric Controller (FC) org as a SWE on an absolutely wonderful team that dealt with mostly container management, VM and Host fault handling & repair policies, and Fabric to Host communication with most of our code in the FC. I drove our team's efforts on the never ending "Unhealthy" node workstream, the final catch-all bucket in the Host fault handler mentioned in OP. I also did heavy work in optimizing repair policies, reactive diagnostics for improved repairs and offline analysis, OS and HW telemetry ingestion from the Host like SEL events into the repair manager in real time, wrote the core repair manager state machine in the new AZ level service that we decoupled from the Fabric, drove Kernel Soft Reboot (KSR)/Tardigrade as a repair action for minimal VM impact for some host repairs, and helped stand up into eventually owning a new warm path RCA attribution service to help drive the root underlying causes of reliability issues and feed some offline analysis back into the live repair manager.
The work was difficult but also really really interesting. For example, Balancing repair policies around reliability is tricky. There's a constant fight in repair policies in grey situations between minimizing total VM downtime vs any VM interruptions/reboots/heals at all, because the repair controller doesn't have perfect information. If telemetry is pointing to VMs being degraded or down on the host, yet in reality they're not, we are the ones inducing the VM downtime by performing an impactful repair. If we wait a little while before taking an impactful repair action, it may be a transient issue that will resolve itself in the moment, at which point we can do much less impactful repairs after like Live Migration if the host is healthy enough. On the flipside, if some telemetry is saying the VMs are up yet they're down in reality and we just don't know it yet, taking time to collect diagnostics and then take a repair action(s) leads to only more overall total downtime.
When I joined in 2017 our team was 7 or 8 people including myself, yet had enough work for at least double that amount of people. On-call was a nightmare the first 2 years. Building Azure back then was like trying to build a car while already sitting behind the steering wheel of that car as it was already barreling down the highway. Everyone on my immediate team the first couple years were a joy to work with, highly competent, hard working, and all of us working absurd hours. For me 60hrs/wk was avg, with many weeks ~80 and a few weeks ~100. Other than the hours though, it was a splendid team environment and I'd like to think we had good engineering culture within our team, though maybe I'm biased. Engineering culture and quality did however vary substantially between orgs and teams. We were heavily under resourced and always needed more headcount, as did nearly every other team in Azure Compute. That never changed during my tenure even though my team's size ballooned to ~20 by 2020, and eventually big enough to where we had to split the team. There was high turnover from the lack of headcount and overwork which was somewhat alleviated by lowering the hiring bar... which obviously opened up another can of worms. This resourcing issue might explain, in part, why Azure is the way it is. We were always playing catchup as a result of the woes of chronic understaffing for years. I eventually burnt out which turned into spiraling mental health, physical health issues, constant panic attacks, and then a full blown mental health crisis after which I took LOA and eventually left the company. I came back briefly for a bit during LOA, and learned that the RCA service I'd built with the help of a coworker (who also left Azure) and was only a small part of our overall workload, had turned into a full fledged team of 9 people dedicated to working on that service in my absence. I know that stating some of this might affect my employment in the future but I don't really care. I know I'm not alone in experiencing burnout working in Azure. It wasn't my manager's fault either, he was amazing. He'd often ask and I would incorrectly yet confidently reassure him that I wasn't burning out but I simply didn't notice the signs. Things are better now though and I'm just happy to be here.
Kudos to the many brilliant people I worked alongside there, I hope you're all doing great.
The first and most important lesson, that I try to each every young developer starting in the industry: Go home after clocking in your hours negotiated in your contract. Drop your pen. Go home. Sleep well.
And I hope, that every sensible senior developer in here does the same. Lead by example. Maybe it would prevent a few burnouts in this industry.
And if you are a manager, then send your people home after they have clocked in their negotiated hours. For their own well-being. It’s your responsibility. And if it’s not working, then force them to go home.
I hope you are better by now and got through the tough time. All the best for you!
2 years of 60+ hours weeks is not good engineering culture, or any kind of culture.
What a fascinating view into how the sausage is made
We signed up to go all-in on Azure because our CEO got an xbox to take home to his kids.
Substack is having its moment. First, deepdelver, now this.
"isolveproblems", really?
I just do not understand how Azure has the scale it does. You only need to login and click around for a bit to see this is not a coherent system designed by competent people. Let alone try and actually build something on it.
Who are the customers? Who is buying this shit?
From my old experience in IT - people just default to Microsoft for everything. They don't want to hassle with learning anything else and assume better the devil you know. Glad I'm out of that world but its wild what people will put up with.
Microsoft shops. Lots of C# devs gravitate to it naturally. I’m glad I abandoned the MS stack over a decade ago.
.NET Core runs just as well on ECS though. And C# tooling is rock solid in VS Code on Mac. No need to touch Azure or Windows.
People and organizations that built things on top of Microsoft tech. Especially with a long history going back to NT times.
HN, YC, startup environment or academia is a Unix bubble. They all feed into each other. Especially because Linux is gratis which helped all of those to deploy projects/products/papers cheaply. Unix systems traditionally lack much of the upper layers, so it is the responsibility of the company, persons, developers to deal with the OS minutea. You need sysadmins, devops, SREs. Those are common roles again in this Unix bubble. The dependency chains here are usually flatter since it keeps mid-term costs lower.
Other organizations like governments and bigger orgs like banks prioritize having somebody else liable (i.e. they can blame) and they prefer to not hire technical competence in their orgs but rely on other companies. This is where Microsoft gets a lot of clients. You buy a bunch of server licenses. Your Microsoft support person installs them and installs IIS via GUI. And then you just upload your code every now and then. The OS updates, IIS server etc are all the responsibility of Microsoft and the middlemen companies. Minimal competence from the orginal org is required. There are multiple middlemen businesses who all give zero fucks about anything but whatever the immediate downstream from them. This is more usual in already publicly traded huge businesses. Moreover the investors actually mandate certain things that only this kind of layers of irresponsibility can deliver :) So you see this kind of switch happening towards IPOs.
Azure is the cloud labeling and forcing the first paradigm over the second paradigm for Microsoft products. It got lots of support because shareholders liked it. I don't think the original NT design and Microsoft's business model was bad, it actually worked very well. However, shareholders gonna shareholder. So they pushed hard for Microsoft and their clients to move to the "cloud". Microsoft executives saw the huge profit and share value potential of pushing Azure the brand too. It was the AI of 2010s afterall.
The VPs who think that they got a good deal by combining with o365
Because for some it works. At least I haven't heard the stories I see here yet at my workplace. Also I use some Azure, but apart from some weird UI bugs never had real big issues.
No idea but I think it's in half or more of the job ads I see in the Netherlands. I don't get it.
This makes it extra silly to trust that Github won't train on your private repos, if they haven't already - just by accident
When things must be shipped quickly, shit breaks and corners are cut; large orgs are full of disfunction. Not sure if such insight was worth of setting your own career on fire.
> That entire 122-strong org was knee-deep in impossible ruminations involving porting Windows to Linux to support their existing VM management agents.
> My day-one problem was therefore not to ramp up on new technology, but rather to convince an entire org, up to my skip-skip-level, that they were on a death march.
> I later researched this further and found that no one at Microsoft, not a single soul, could articulate why up to 173 agents were needed to manage an Azure node
This is most corporates. I'm sure this was celebrated as as a successful project and congratulations to everyone, along with big bonuses, RSU, raises, and promotions, mostly to other orgs to bring this kind of 'success' to other projects (or other companies). These people mostly are gone in less than 2 years. They continue to take 'wins'.
The VPs are dumb as shit, but they need 'successful' projects that have fancy names that they can present to their exec team.
The 173 agents are to give wins to a large number of people and teams, all these people contributed to this successful project.
If it continues, there will be a lessons learned powerpoint, followed by 10x growth in headcount, promotions to everyone and double down. 270 people can deliver a baby in 1 day and all that.
In part 2
> This group was now tasked with moving their inherited stack to the new Azure Boost accelerator environment, an effort Microsoft had publicly implied was well underway at Ignite conferences since 2023.
The goal is to attach your projects to something announced by the CEO and ride the career rocketship!
> Few engineers could reliably build the software locally; debugger usage was rare (I ended up writing the team's first how-to guide in 2024); and automated test coverage sat below 40%.
A key clue and explains why so much of what Microsoft puts out is garbage. Wow.
> Few engineers could reliably build the software locally
I've just listened to Longhorn story on Monday and have heard the same thing.
Could you link the story by any chance? I've been using Longhorn for a while and on one particular system, it has an odd tendency to corrupt XFS.
This reads like Google culture too...
Microsoft Azure has always been a clown show. I've found so many obvious bugs. The quality is not there and never will be. No serious companies rely on it. Use virtually any other vendor or host it yourself.
This is an insanely blunt look into some serious issues with microsoft.
i run fastapi APIs on linode with cloudflare in front and honestly the simplicity is underrated. predictable billing, docs that match reality, no surprise platform regressions. for a straightforward API workload the hyperscaler tax doesn't make sense unless you genuinely need their scale
At this point, it’s very clear that people nowadays choose Rust mostly to be part of the cult rather than clearly understanding its shortcomings and advantages over languages such as C, C++. It has gotten to the point that some devs after watching a YouTube video criticizing C++ for two hours, announce C++ the worst programming language. Unfortunately, such people become decision makers at giant tech companies too.
"The company formalized the idea that defects could be fixed through human intervention on live production systems" (From Part 5).
Uh...yeah. I think we all realized that years ago.
Great but then you tie your growth to the support people headcount. Normally you would see enormous costs upfront for R&D and bringing the thing up, then marginal costs when adding capacity (the hardware, mostly)—if capacity is proportional to the number of humans looking after the system, you will soon hit a limit, and the cost won’t look good either.
I've said it before and I'll say it again. I'm glad rust has good package management I really am. However given that aspect, it ends up forming a dependency heavy culture. In situations like this it's hard to use dependencies due to the amount of transitive dependencies some items pull in. I really which this would change. Of course this is a social problem so I don't expect a good answer to come of this....
Environment is part of the package management. As it stand, it's better than npm only because it is in rust.
That bar is screwed to the floor.
"The company formalized the idea that defects could be fixed through human intervention on live production systems"
Uh...yeah. I think we all realized that years ago.
Any complex system - and these cloud systems must be immensely complex - accumulate cruft and bloat and bugs until the entire thing starts to look like an old hotel that hasn’t been renovated in 30 years.
It’s not inevitable. Absolutely this is true without significant effort, but if you’ve been around the traps for long enough (in enough organisations), you get to see that the level of quality can vary widely. Avoiding the mud-pit does require a whole org commitment, starting from senior leadership.
This story is more interesting, in my opinion, in how quickly things devolved and also how unwilling the more senior layers of the org were to address it. At a whole company level, the rot really sets in when you start to lose the key people that built and know the system. That seems to be what’s happening here, and it does not bode well for MS in the medium term.
til: there’s individuals/people that "trusted" azure at all
I only used that shit platform because some Microsoft consultant convinced idiotic C-suite that Azure was the future.
A former Azure Core engineer’s 6-part account of the technical and leadership decisions that eroded trust in Azure.
Why do you speak about yourself in the third person?
Also, after this:
https://news.ycombinator.com/item?id=20341022
You continued to work at Microsoft and now there is this takedown?
I'm no friend of MS (to put it very mildly) but it seems to me your story is a bit inconsistent as well as the 7 year break between postings.
The comment comes from the input field on the post form. Not clear it would show up as a comment. The old thread you refer to had little to do with Microsoft per se. Let me known if I can help with the inconsistencies you mention?
> Why do you speak about yourself in the third person?
When you submit a link to HN, there is an entry field for text in addition to the url.
It does not really describe what the text is used for. For links, the content of that field is simply added as the first comment.
Someone who is unfamiliar with the submission process may assume this field should describe what they are submitting, and not format it like a comment.
Then that text gets posted as the first comment and tons of people downvote it, jumping to the conclusion that the weird summary comment is from an AI, and not the submitter describing their own submission.
(I also assumed these comments were AI until someone else pointed this out)
Could not have said it better myself. Thanks.
AH! Thanks, that's useful context!
I downvoted this comment for sounding like a summarizing LLM, not adding anything substantial beyond the title of the post, before realizing you were the poster and author.
I didn’t know that “subtitle” would appear as first comment.
huh, i didn't realise that's what that does either
What's your assessment of AWS and GCP? Do you think it's likely they suffer from some of the same issues (eg the manual access of what should be highly secure, private systems, the instability, the lack of security)?
As a former GCP engineer, no, the systems are not generally unstable or insecure.
There is definitely manual access of data - it requires what was termed “break glass” similar to the JIT mechanism described by the author. However, it wasn’t quite so loose; there were eventually a lot of restrictions on who could approve what, what access you got after approval, and how that was audited.
It was difficult to get into the highest sensitivity data; humans reviewed your request and would reject it without a clear reason. And you could be 100% sure humans would review your session afterwards to look for bad behavior.
I once had to compile a large list of IP addresses that accessed a particular piece of data to fulfill a court order. It took me days of effort to get and maintain the elevated access necessary to do this.
I have a lot of respect for GCP as an engineering artifact, but a significantly less rosy opinion of GCP as an organization and bureaucratic entity. The amount of wasted effort expended on engaging with and navigating the bureaucracy is truly mind-boggling, and is the reason why a tiny feature that took a day to code could take months to release.
TLDR: It turns out that Nadella despite being an engineer is actually quite bad at managing engineering. Who would have thought?
I thought he was a PM.
What an epic takedown.
Microsoft should have promoted this guy instead of laying him off.
Did Microsoft really lose OpenAI as a customer?
The answer to your question is in the public releases. MS went from primary partner (under ROFR) to one of the options. They retain IP rights and API hosting, although in recent weeks we learned that OpenAI was planning a workaround with AWS and Microsoft said they might sue them for that. So the happy marriage is over, it’s more like a custody battle now: https://www.reuters.com/technology/microsoft-weighs-legal-ac...
New trollaxor dropped.
The first couple of paragraphs felt like a parody of a guy who goes to a diner and gets upset the waitress doesn’t address him as Dr.
It didn’t get any better.
His writing style is fairly over the top (he is Swiss, and I have seen this before, but not most of the time), but most of the technical content seems true to me.
In all fairness, you are right :)
Do not forget all that femboy stuff.
pardon?
I've worked in Windows for many many years, no idea who this guy is. He is randomly name dropping. He wants attention.
and what's with the parents? for running containers?