Nobody uses Amex for payments, so the system isn't ever under high load.
Just kidding!
I find the idea quite good, and have to assume that the amount of payment fails they experience due to partitions/outages isn't very high and that the post-payment reconciliation and reclamation process gives them the liberty to rank availability a bit higher than correctness.
One thing that looked a bit shaky was the interplay between the global transaction router's state of knowing which cells can handle a particular payment and the asynchronous distribution of the "failover data", which I presume it needs to know to route correctly. To me that seems to create a window where it might route to the wrong cell due to an outdated routing state.
It also doesn't go into the HA setup of the global transaction router itself.
>To me that seems to create a window where it might route to the wrong cell due to an outdated routing state.
But if the router sends to the wrong cell the cell will either send it back to be rerouted or it will fail and the router will try again (or report back the failure so upstream can try again I assume)
Generally with a credit card, or many banking systems more generally, because they predate computers, it's possible that a charge might be accepted even if there's no knowledge whether the money is in the account. As long as the person who was supposed to have paid is identifiable, the money is taken from their account anyway in the end, and if they don't have it, they get sued and their wages garnished, and if they also don't have wages, that's a small enough percentage of people that it's part of the cost of doing business.
Some of it sounds like it reinvented Erlang supervision trees https://learnyousomeerlang.com/supervisors. As a joke there we’re calling gen_severs “nanoservices”. Granted, that was mostly when microservices were the hot new thing.
Immediately my first thought as well. I still keep coming back to Erlang/Elixir/OTP as the best possible choice all things considered. I should use it more.
This isn't about payment technologies, it's not about isolating transactions, it's about scaling the middle layer. What's worse it's not even explained what middle layer does.
No info on how routing works, no info on data synchronization.
Folks just learning Kubernetes and write extremely abstract stuff.
I agree, confusing and it seems like they are using the coined word "cell" to describe "container", but they really should say that instead of making up new words.
microservices / clusters / zones - really all of these are other "cell-based" architectures as well. there is absolutely no written rule that a microservice was just an API or a singular service, it basically can be a independent instance that is testable/usable/gives value on itself.
that said: still a nice write up, learning about some of the architectural choices that AMEX makes is definitely insightful (and relavent/useful to what i am working on right now as well!)
GLBs aren’t SPOFs. They are typically deployed around the world redundantly, often using Anycast IPs or using DNS geographic and failover records, and are stateless. Think AWS Global Accelerator and Route 53 as an example. The architecture diagram is a high level simplification.
I don't think the global transaction router is a GLB. Having dabbled in this for high traffic telemetry gathering infrastructure, I will hazard a guess and say the "router" isn't a GLB.
The router needs to be shard-aware. It needs to know what data is where based on the request coming in so that it can route accurately. A GLB is DNS. It cannot be shard-aware because all it knows is the FQDN being resolved.
It can be a "router" if all the router needs to know is to resolve to the nearest data center or the nearest CDN. But at that point I have to ask the question - why does one need a cell-based architecture and can't it just be geo-redundant active-active failover across regions.
Active/active without sharding is not a horizontal scaling model, and the blast radius of a fault is wide.
One can have GLBs that do routing. So long as the tenant-to-cell routing tables are consistent, it works fine. And those mappings tend not to change frequently.
A large portion of DNS is outside of your control. You're relying on at least two third parties you have indirect relationships with in order to work. If you're outside of the standard TLDs you've got additional social factors that can control your resolution.
Granted. It works really well in practice. It should be noted we haven't actually had the world war the Internet was designed to survive. So we're not entirely clear on the semantics of operations in unusual and unexpected configurations. I would expect DNS to be the first shoe to drop there.
Because of the title I was expecting to read about doing payments with a distributed network, like a terrorist cell network, or something like Hawala. Not (as I infer from other comments) Amex using multiple independent systems.
I wonder how they ensure durability. Is it possible that a cell going down would roll back a payment after it has occurred. Or do they depend on a non cell database?
I would assume nothing related to a given transaction crosses the cell boundary.
We use a cellular architecture to help constrain the blast radius of a modular monolith. Each one of our customers lives in exactly 1 cell. Any kind of cross-customer BI/reporting happens through a data warehouse.
Backups in such a system are quite pointless; if losing 10 seconds of data means you lost 4000 transactions then periodic backups are invalid if not instantly than close to instantly.
The system I work on has such a property and the only real infra style approach is sync replication before responding to a caller and a delayed replica for delete/drop protections (say with a 2hr or more window).
Should also defend for this in your code (be able to reply from your initiation systems also etc)
Maybe? If you assume a cell can just disappear at a moment's notice, then I'm guessing you don't even try backing it up. Whatever goes into and out of the cell (request logs and results) gets backed up, and no doubt that's more complicated than a monolithic system, but it may not be so bad assuming the replay systems and global transaction router do their thing?
Don't know if Joe Armstrong ever said anything like it, but I would propose naming an Erlang/OTP analogue of Greenspun's tenth rule (the one about C projects containing ad-hoc, buggy implementations of Lisp) for him.
Any sufficiently complicated concurrent program in another language contains an ad hoc informally-specified bug-ridden slow implementation of half of Erlang.
To be fair Elixir shows you can just use the BEAM if you want. If you need these semantics at this level there's very few reasons not to go this route.
Ah yes, the financial services company that runs a travel agency, allows me to book my hotel and rental car weeks in advance, registers a hold for incidentals for both the hotel and car when I check in, then blocks the card when I try to buy dinner that night in that same hotel due to fraud detection.
Last week it required me to take pictures of my face from multiple angles to regain membership privileges. I suspect this may be part Palantir data collection and part Peter Thiel dating service.
American Express tech is some of the worst in the world among big companies. All of the value in the company is just in the branding. They put some work into the mobile app and the website, but other than that, its a facade.
A few years ago someone kept signing up for loads of bank accounts/credit cards in my name, with my address. I’m not sure what the point of it was. But while everyone else happily sent cards and stacks of welcome paperwork to me, Amex were the only one that contacted me and told me they’d detected something weird in the signup. They gave me some helpful advice to resolve that situation too.
I froze my credit with the 3 big credit agencies in the US years ago when someone attempted to open multiple Dell and other company accounts in my name. Easy enough to unfreeze for a temporary period of time when I need it.
Having worked at Amex and other huge banks, let me assure you that there's much worse than Amex. Amex's Fraud analytics team was good. Risk was good. Ben's team is good.
BoA reauthorized an auto-payment card even with the card being expiration and uncorrected security code. I would call that authorized fraud by Bank of American.
This is why I find it best to declare a card stolen right before expiration or after.
Nobody uses Amex for payments, so the system isn't ever under high load.
Just kidding!
I find the idea quite good, and have to assume that the amount of payment fails they experience due to partitions/outages isn't very high and that the post-payment reconciliation and reclamation process gives them the liberty to rank availability a bit higher than correctness.
One thing that looked a bit shaky was the interplay between the global transaction router's state of knowing which cells can handle a particular payment and the asynchronous distribution of the "failover data", which I presume it needs to know to route correctly. To me that seems to create a window where it might route to the wrong cell due to an outdated routing state.
It also doesn't go into the HA setup of the global transaction router itself.
But still, I kind of like the design.
>To me that seems to create a window where it might route to the wrong cell due to an outdated routing state.
But if the router sends to the wrong cell the cell will either send it back to be rerouted or it will fail and the router will try again (or report back the failure so upstream can try again I assume)
That would be the good case.
But what if the cell doesn't know that, and it's holding, for example, a stale account number?
Generally with a credit card, or many banking systems more generally, because they predate computers, it's possible that a charge might be accepted even if there's no knowledge whether the money is in the account. As long as the person who was supposed to have paid is identifiable, the money is taken from their account anyway in the end, and if they don't have it, they get sued and their wages garnished, and if they also don't have wages, that's a small enough percentage of people that it's part of the cost of doing business.
Amex is gaining popularity for acceptance
Do they still charge ridiculously high fees to merchants?
Some of it sounds like it reinvented Erlang supervision trees https://learnyousomeerlang.com/supervisors. As a joke there we’re calling gen_severs “nanoservices”. Granted, that was mostly when microservices were the hot new thing.
Immediately my first thought as well. I still keep coming back to Erlang/Elixir/OTP as the best possible choice all things considered. I should use it more.
Whole lot of nothing.
This isn't about payment technologies, it's not about isolating transactions, it's about scaling the middle layer. What's worse it's not even explained what middle layer does.
No info on how routing works, no info on data synchronization.
Folks just learning Kubernetes and write extremely abstract stuff.
I agree, confusing and it seems like they are using the coined word "cell" to describe "container", but they really should say that instead of making up new words.
Making up new words is how you get to take ownership of existing concepts. This gets you promoted.
microservices / clusters / zones - really all of these are other "cell-based" architectures as well. there is absolutely no written rule that a microservice was just an API or a singular service, it basically can be a independent instance that is testable/usable/gives value on itself.
that said: still a nice write up, learning about some of the architectural choices that AMEX makes is definitely insightful (and relavent/useful to what i am working on right now as well!)
All i can see is a giant single point of failure called the Global Transaction Router.
GLBs aren’t SPOFs. They are typically deployed around the world redundantly, often using Anycast IPs or using DNS geographic and failover records, and are stateless. Think AWS Global Accelerator and Route 53 as an example. The architecture diagram is a high level simplification.
I don't think the global transaction router is a GLB. Having dabbled in this for high traffic telemetry gathering infrastructure, I will hazard a guess and say the "router" isn't a GLB.
The router needs to be shard-aware. It needs to know what data is where based on the request coming in so that it can route accurately. A GLB is DNS. It cannot be shard-aware because all it knows is the FQDN being resolved.
It can be a "router" if all the router needs to know is to resolve to the nearest data center or the nearest CDN. But at that point I have to ask the question - why does one need a cell-based architecture and can't it just be geo-redundant active-active failover across regions.
In any sense, the architecture itself isn't novel or new. It's documented here: https://docs.aws.amazon.com/wellarchitected/latest/reducing-.... It's the go to model if you're running a cloud.
Active/active without sharding is not a horizontal scaling model, and the blast radius of a fault is wide.
One can have GLBs that do routing. So long as the tenant-to-cell routing tables are consistent, it works fine. And those mappings tend not to change frequently.
GLBs absolutely can be SPOF for certain kind of administrative mistakes.
If you’re counting administrative mistakes (human error), anything can fail. Let’s not shift the goalposts.
That depends on your change control process
A large portion of DNS is outside of your control. You're relying on at least two third parties you have indirect relationships with in order to work. If you're outside of the standard TLDs you've got additional social factors that can control your resolution.
Granted. It works really well in practice. It should be noted we haven't actually had the world war the Internet was designed to survive. So we're not entirely clear on the semantics of operations in unusual and unexpected configurations. I would expect DNS to be the first shoe to drop there.
Isn't it reassuring the entire CC business hinges on a single proprietary appliance inside a cage at some DC? :)
yo, 2022 called and they want their buzzwords back:
https://news.ycombinator.com/item?id=32023863
https://wso2.com/engineering-platform/developer-platform/doc...
403 Forbidden
Because of the title I was expecting to read about doing payments with a distributed network, like a terrorist cell network, or something like Hawala. Not (as I infer from other comments) Amex using multiple independent systems.
I wonder how they ensure durability. Is it possible that a cell going down would roll back a payment after it has occurred. Or do they depend on a non cell database?
I would assume nothing related to a given transaction crosses the cell boundary.
We use a cellular architecture to help constrain the blast radius of a modular monolith. Each one of our customers lives in exactly 1 cell. Any kind of cross-customer BI/reporting happens through a data warehouse.
Backing up would be hell
Backups in such a system are quite pointless; if losing 10 seconds of data means you lost 4000 transactions then periodic backups are invalid if not instantly than close to instantly.
The system I work on has such a property and the only real infra style approach is sync replication before responding to a caller and a delayed replica for delete/drop protections (say with a 2hr or more window).
Should also defend for this in your code (be able to reply from your initiation systems also etc)
Maybe? If you assume a cell can just disappear at a moment's notice, then I'm guessing you don't even try backing it up. Whatever goes into and out of the cell (request logs and results) gets backed up, and no doubt that's more complicated than a monolithic system, but it may not be so bad assuming the replay systems and global transaction router do their thing?
There things are always a clusterfsck compared to the mainframe deployments.
Ahahha so true man!
Some CICS regions, a DB2 and a couple of VSAMs and that's it.
As Reddit already pointed out, this is nothing novel.
“They reinvented Erlang OTP.” - Reddit
Don't know if Joe Armstrong ever said anything like it, but I would propose naming an Erlang/OTP analogue of Greenspun's tenth rule (the one about C projects containing ad-hoc, buggy implementations of Lisp) for him.
There is Virdings Rule:
Any sufficiently complicated concurrent program in another language contains an ad hoc informally-specified bug-ridden slow implementation of half of Erlang.
To be fair Elixir shows you can just use the BEAM if you want. If you need these semantics at this level there's very few reasons not to go this route.
Still nice to have learning resources like this pass through HN even if it’s not novel.
This service oriented architecture except more expensive and complicated.
Ah yes, the financial services company that runs a travel agency, allows me to book my hotel and rental car weeks in advance, registers a hold for incidentals for both the hotel and car when I check in, then blocks the card when I try to buy dinner that night in that same hotel due to fraud detection.
Last week it required me to take pictures of my face from multiple angles to regain membership privileges. I suspect this may be part Palantir data collection and part Peter Thiel dating service.
I would have nope'd out so hard if they asked for face pictures.
That's just normal for banks now
American Express tech is some of the worst in the world among big companies. All of the value in the company is just in the branding. They put some work into the mobile app and the website, but other than that, its a facade.
A few years ago someone kept signing up for loads of bank accounts/credit cards in my name, with my address. I’m not sure what the point of it was. But while everyone else happily sent cards and stacks of welcome paperwork to me, Amex were the only one that contacted me and told me they’d detected something weird in the signup. They gave me some helpful advice to resolve that situation too.
I froze my credit with the 3 big credit agencies in the US years ago when someone attempted to open multiple Dell and other company accounts in my name. Easy enough to unfreeze for a temporary period of time when I need it.
Having worked at Amex and other huge banks, let me assure you that there's much worse than Amex. Amex's Fraud analytics team was good. Risk was good. Ben's team is good.
BoA reauthorized an auto-payment card even with the card being expiration and uncorrected security code. I would call that authorized fraud by Bank of American.
This is why I find it best to declare a card stolen right before expiration or after.
What are you basing that statement on? It has not been by personal experience.
Makes me a little nervous that a web page about resilience is failing to connect.
They run their payment systems on ps3??? Somebody bought into the marketting a bit much.
So you’re telling me these cells operate independently like distributed Ethereum nodes and L2s… got it.
Ethereum nodes are not independent, they are as interdependent as it's possible to be.
It's becoming a reliable heuristic that when somebody says "... got it", they probably didn't get it.