AWS’s US-East 1 continues to be the Achilles heel of the Internet.
And while yes building across multiple regions and AZs is a thing, AWS has had a string of issues where US-East 1 has broader impacts, which makes things far less redundant and resilient than AWS implies.
In fantasy magic dream land loads are distributed evenly across different cloud providers.
A single point of failure doesn't exist.
It worked out with my first girlfriend. The twins are fluent in English and Korean. They know when deploying a large scale service to not only depends on AWS.
Healthcare in the US is affordable.
All types of magical stuff exist here.
But no. It's another day. AWS US-East 1 can take town most of the internet.
It really is failing more, and it’s well known amongst industry experts. It’s the oldest, largest, and most utilized region of AWS.
I’ve heard people say that the underlying physical infrastructure is older, but I think that’s a bit of speculation, although reasonable. The current outage is attributed to a “thermal event”, which does indeed suggest underlying physical hardware.
It’s also the most complex region for AWS themselves, as it’s the “control pad” for many of their global services.
anecdotally (well, more "second-hand-ly i heard that..." it sounds like there were some carry-on effects on us-east-2 as a result of people migrating over from us-east-1, so, yeah... kinda hilarious how the multiple region / AZ thing is just so plainly a façade, but yet we all seem to just collectively believe in it as an article of faith in the Cloud Religion... or whatever...
One of the SRE tricks is to reserve your capacity so when the cloud runs out of capacity you're still covered. It's expensive, but you don't want to get stuck without a server when the on-demand dries up.
Thankfully their leadership is leading the way in ethics since inception, so I am confident that no such shenanigans will ever take place. I may even bet on this.
> These things are dangerous. Someone who can take AWS down such as an employee can place a bet.
Imagine if the betting website itself shuts down because AWS is down. (half joking I suppose though)
> These bets aren’t as innocent as they seem because the bettors can often influence or change the outcome.
Overall I agree with your statement that these betting markets also are able to incentivize a lot of insider trading and one can say negative scenarios as this has given them an incentive to capitalize on that.
I thought cooling was pretty much pre-planned in any data center, and you simply don't install more stuff than you can cool?
So did some cooling equipment fail here or was there an external reason for the overheating? Or does Amazon overbook the cooling in their data centers?
This is almost definitely an issue of equipment failure.
Cooling in datacenters is like everything else both over and under provisioned.
It's overprovisioned in the sense that the big heat exchange units are N+1 (or in very critical and smaller load facilities 2N/3N). This is done because you need to regularly take these down for maintenance work and they have a relatively high failure rate compared to traditional DC components and require mechanical repairs that require specialized labor and long lead times. In a bigger facility its not uncommon to have cooling be N+3 or more when N becomes a bigger number because you're effectively always servicing something or have something down waiting for a blower assembly which needs to be literally made by a machinist with a lathe because that part doesn't exist anymore but that's still cheaper than replacing the whole unit.
The system are also under-provisioned in the sense that if every compute capacity in the facility suddenly went from average power draw to 100% power draw you would overload the cooling capacity, you would also commonly overload things in the electrical and other paths too. Over provisioning is just the nature of the industry.
In general neither of these things poses a real problem because compute loads don't spike to 100% of capacity and when they do spike they don't spike for terribly long and nobody builds facilities on a knife-edge of cooling or power capacity.
The problem comes when you have the intersection of multiple events.
You designed your cooling system to handle 200% of average load which is great because you have lots of headroom for maintenance/outages.
Repair guy comes on Tuesday to do work on a unit and finds a bad bearing, has to get it from the next state over so he leaves the unit off overnight to not risk damaging the whole fan assembly (which would take weeks to fabricate).
The two adjacent cooling units are now working JUST A BIT harder to compensate and one of them also had a motor which was just slightly imbalanced or a fuse which was loose and warming up a bit and now with an increased duty cycle that thing which worked fine for years goes pop.
Now you're minus two units in an N+2 facility. Not really terrible, remember you designed for 200% of average load.
That 3rd unit on the other side of the first failed unit, now under way more load, also has a fault. You're now minus 3 in a N+2 facility.
Still, not catastrophic because really you designed for 200% of average load.
The thing is, it's now 4AM, the onsite ops guy can't fix these faults and needs to call the vendor who doesn't wake up till 7AM and won't be onsite till 9.
Your load starts ramping up.
Everything up above happens daily in some datacenter in the USA. It happens in every datacenter probably once a year.
What happens next is the confluence of events which puts you in the news.
One of your bigger customers decides now is a great time to start a huge batch processing job. Some fintech wants to run a huge model before market open or some oil firm wants to do some quick analysis of a new field.
They spin up 10000 new VMs.
Normally, this is fine, you have the spare capacity.
But, remember, you planned for 200% of AVERAGE cooling capacity and this is not nodes which are busy but not terribly busy, these are nodes doing intense optimized number crunching work which means they draw max power and thus expel max waste heat.
Not only has your load in terms of aggregate number of machines spiked but their waste heat impact is also greater on average.
Boom, cascading failure, your cooling is now N-4.
Server fans start ramping up faster which consumes more power.
Your cooling is now N-5.
Alarms are blaring all over the place.
Safeties on the cooling units start to trip as they exceed their load and refrigerant pressures rise.
Reminds when i did noogler training back in the day and one of the talks described a cascading failure at a datacenter, starting with a cat which was too curious near a power conditioner, and briefly conducted
Its cold up here in the winter, sadly, the residual heat from even totally passive components like switch gear is enough to warm things up enough to attract them. .001% of 1MW of power is still quite warm. (I have no idea how much switchgear leaks but i know they are warm even in winter outdoors).
And, yeah, the rest of the writeup is also an amalgamation of some panic-inducing experiences in my life.
I'd expect someone like AWS to just throttle machines before overloading their cooling. Because they probably can do that, while e.g. a data center that just rents the space can't really throttle their customers nicely.
Reducing clock speeds, even if they could do that -- and I'm not sure they can, given how Nitro is designed -- would be problematic since a lot of customer workloads assume homogeneous nodes.
But they did load-shed. Perhaps not soon enough, but the reason this is publicly known is because they reduced the amount of heat being produced.
Right, exactly, I highly doubt the facility went into any kind of actual uncontrolled thermal rise. This is news because they had to take such drastic actions. I'm sure its common that they force spot prices up (probably way up) to compensate for reduced capacity due to events, I'm sure they even sometimes fake no capacity for similar reasons. No capacity means "I don't want to turn on your node" not merely "I don't have any more physical servers I could turn up for you".
This is news because they powered off some non-preemptible customer loads, which actually makes me wonder if you saw that chain of events occur here.
spot prices rise -> new instance availability goes to 0 -> preemptible instances go dark -> normal instances go dark.
Its harder and harder to throttle machines with hardware segmentation capabilities effectively passing through hardware components "intact"
A decade ago it was trivial to just tell the hypervisor to reduce the cpu fraction of all VMs by half and leave half unallocated. Now, it's much more complicated and definitely would be user visible.
The cooling units dont fail just because they get to 100% duty cycle. That's pretty much "normal operation", you just get... higher efficiency coz the cooling side is warmer
I would have thought with all the data centers being built the parts for cooling systems would be standardized with replacements available from Grainger immediately.
I once worked at a company that had a wealth of backups. A backup generator, backup batteries as the generator takes a few seconds to start, a contract for emergency fuel deliveries, a complete failover data centre full of hot standby hardware, 24/7 ops presence, UPSes on the ops PCs just in case, weekly checks that the generators start, quarterly checks by turning off the breakers to the data centre, and so on.
It wasn't until a real incident that we learned: (a) the system wasn't resilient to the utility power going on-off-on-off-on-off as each 'off' drained the batteries while the generator started, and each 'on' made the generator shut down again; (b) the ops PCs were on UPSes but their monitors weren't (C13 vs C5 power connector) and (c) the generator couldn't be refuelled while running.
Even if you've got backup systems and you test them - you can never be 100% sure.
The point of being "cloud native" is you build redundancy at higher levels. Instead of having extra pipes and wires, you have extra software that handles physical failures.
It's always East 1... Jokes aside I don't understand how often east-1 is taken down compared to other regions. Like it should be pretty similar to other regions architecture wise.
Isn't east one the "core" datacenter and also the oldest? I'd imagine it has more load than the other regions and also has more tech debt and architectural / engineering debt because they had less experience when they built it.
Also iirc some services rely on east-1 as a single point of failure for configuration (like IAM or some S3 stuff?)
So Ashburn VA is a datacenter hub because the very first non-government Internet Exchange Point (IXP) anywhere in the world was there (https://en.wikipedia.org/wiki/MAE-East). Back in the 1990's something like half of all internet traffic all over the world hit MAE-East. That in turn made AWS put their first region there (us-east-1 preceded eu-west-1 by 2 years and us-west-1 by 3 years). Then because there were lots of people who knew how to build DC's- and lots of vendors who knew how to supply them- the Dulles Corridor became a major hub for lots of companies datacenters. For AWS, because us-east-1 was the first, it's by far the most gnarly and weird- and a lot of control planes for other AWS services end up relying on it. Which is why it goes down more often than other regions, and when it does go down it makes national news, unlike, say, eu-south-2 in Spain.
But NoVA is basically the same sort of economic cluster that Paul Krugman won his Nobel Prize in Economics for studying, just for datacenters, not factories.
Well said. I'll also add, that with these networks, the sooner you can get traffic off your network the better. There's strong incentive to have your datacenter near these peering points. And since MAE-East was the first, it's been the largest as it's been snowballing the oldest. AOL's HQ was here, Equinix built their peering point soon after MAE-East, etc.
As for AWS, I often see it repeated that the DCs are the oldest and therefor in disrepair. That's not true; many of the first ones have since been replaced. But there are services that are located here and only here.
But I'll also add, a lot of customers default to using US-East-1 without considering others, and too many deploy in only one AZ. Part of this is AWS's fault as their new services often launch in US-East-1 and West-2 first, so customers go to East-1 to get the new features first.
Speaking as one who was with AWS for 10 years as a TAM and Well-Architected contributor, I saw a lot of customers who didn't design with too much resiliency in mind, and so they get adversely affected when east-1 has an issue (either regional or AZ). The other regions have their fair bit of issues as well. It's not so much that east-1 necessarily fails more than the others, it's that it has so many AZs and so many workloads that people notice it more.
The underlying reason is more that by being in us east coast you have about equal latency for customers in us west coast and Europe. That's a very large population covered from a single site.
If you're building a single datacenter site this is where you start building first.
Amusingly I've been part of two critical downtime heating incidents at two different datacenters: one was when Hosting.com's SOMA datacenter got so hot that they were using hoses on the roof to cool it down; and the second one was when Alibaba's Chai Wan datacenter got so hot everything running there went down, including the control plane. So I imagine the proximity to the ocean does not yield any additional advantage in terms of emergency heat sinking. You have x capacity to pump heat out and it doesn't matter if you're next to the sea or in the middle of Nebraska because your entire system needs to be built to be rated for some performance.
I had a class in my masters about data centers (HPC Infrastructures). The professor was using some data centers somewhere in the middle of USA, in an area with hot weather as example. He compared that with ideal scenario (weather, power source, etc.).
In one of the slides, there were factors that influence the decision of where to build a data center, and several of the items involved finding a place with enough space and skilled people to work at this data center. He also commented sometimes there is politics involved on choosing the place for a next data center.
Oceans have salt. Saltwater is bad for electronics beyond normal water. You also need a sufficient level of water depth otherwise it'll warm to surface temperature. It also needs to be price-competitive with traditional evaporative cooling.
Toronto is the textbook example of this working. It's on a freshwater lake that is deep relatively close to the shore, and the downtown has expensive real estate blocking traditional methods.
In a proper 2-loop cooling system, the primary loop (with direct electronics contact) and secondary loop (with seawater/external cooling source) are hydraulically isolated by a heat exchanger. The salt water or whatever never gets anywhere near the electronics.
Saltwater comes in the air. Just being near it corrodes everything. Both stainless steel and bronze are very expensive. Even if things were made of corrosion proof materials, not everything can be, for strength reasons.
The problem is, it's still in contact with something, even if it's just the secondary loop. Saltwater is not just incredibly aggressive against metal, the major problem with using it for cooling is fouling. Fish, mussels, algae, debris, there are a lot of things that can clog up your entire setup.
Off the top of my head:
Ocean levels of salt in a water system are much more expensive to maintain (even the secondary loop).
Coastal land much more expensive.
If you go to a remote coastal site, you probably won't have as good access to power.
Coastal sites usually exposed to more severe weather events.
Other fun unpredicatble things eg-Diablo Canyon nuclear facility has had issues with debris and jellyfish migration blocking their saltwater cooling intake.
And oysters / mussels / clams / every other creature that starts small and turns calcium into brick finds your cooling system to be a delightful place to raise a family, especially in delicate heat exchangers with small easily blockable passages.
Lots of proposals to build them near Lake Michigan recently but the residents of Wisconsin only want auto parts stores and paper mills. They've been completely demonized. Cities and counties are passing no data center laws even though it's the perfect place for it.
Paper mills need a lot of heat energy to run the processes. Data centres produce a lot of heat. Sounds like a good combination?
Cold water -> data centre cooling loop - > warm water -> paper mill with heat pumps to transform low-grade heat into the required temperatures -> profit
They are, sometimes. Google built this one in Finland in 2011 at the site of an old paper mill, which was already set up to draw water from the Baltic Sea (which isn't as salty as the Atlantic is, but still not fresh water):
> Using a cooling system with seawater from the Bay of Finland and a new offsite heat recovery facility, our Hamina data centre is at the forefront of progressing our sustainability and energy-efficiency efforts.
Once known for having super reliable services, I've heard this company is scrambling to re hire some of the engineers they overconfidently "replaced" with AI.
When customers pay for cloud services, they expect them to be maintained by competent engineers.
edit: Not sure why the downvotes. If you fire the engineers that have been keeping your systems running reliably for years, what do you expect to happen?
This is correct... unless there is a specific requirement to be in that location for some kind of IXP or ultra low latency, I can't imagine putting mission-critical things in only that region.
And while yes building across multiple regions and AZs is a thing, AWS has had a string of issues where US-East 1 has broader impacts, which makes things far less redundant and resilient than AWS implies.
In fantasy magic dream land loads are distributed evenly across different cloud providers.
A single point of failure doesn't exist.
It worked out with my first girlfriend. The twins are fluent in English and Korean. They know when deploying a large scale service to not only depends on AWS.
Healthcare in the US is affordable.
All types of magical stuff exist here.
But no. It's another day. AWS US-East 1 can take town most of the internet.
But even then , the load balancer needs to run somewhere. Which becomes a new single point of failure.
I’m sure someone smarter than me has figured this out.
Last i heard azure outage it wasn’t even on HN frontpage
I’ve heard people say that the underlying physical infrastructure is older, but I think that’s a bit of speculation, although reasonable. The current outage is attributed to a “thermal event”, which does indeed suggest underlying physical hardware.
It’s also the most complex region for AWS themselves, as it’s the “control pad” for many of their global services.
These bets aren’t as innocent as they seem because the bettors can often influence or change the outcome.
Imagine if the betting website itself shuts down because AWS is down. (half joking I suppose though)
> These bets aren’t as innocent as they seem because the bettors can often influence or change the outcome.
Overall I agree with your statement that these betting markets also are able to incentivize a lot of insider trading and one can say negative scenarios as this has given them an incentive to capitalize on that.
So did some cooling equipment fail here or was there an external reason for the overheating? Or does Amazon overbook the cooling in their data centers?
Cooling in datacenters is like everything else both over and under provisioned.
It's overprovisioned in the sense that the big heat exchange units are N+1 (or in very critical and smaller load facilities 2N/3N). This is done because you need to regularly take these down for maintenance work and they have a relatively high failure rate compared to traditional DC components and require mechanical repairs that require specialized labor and long lead times. In a bigger facility its not uncommon to have cooling be N+3 or more when N becomes a bigger number because you're effectively always servicing something or have something down waiting for a blower assembly which needs to be literally made by a machinist with a lathe because that part doesn't exist anymore but that's still cheaper than replacing the whole unit.
The system are also under-provisioned in the sense that if every compute capacity in the facility suddenly went from average power draw to 100% power draw you would overload the cooling capacity, you would also commonly overload things in the electrical and other paths too. Over provisioning is just the nature of the industry.
In general neither of these things poses a real problem because compute loads don't spike to 100% of capacity and when they do spike they don't spike for terribly long and nobody builds facilities on a knife-edge of cooling or power capacity.
The problem comes when you have the intersection of multiple events.
You designed your cooling system to handle 200% of average load which is great because you have lots of headroom for maintenance/outages.
Repair guy comes on Tuesday to do work on a unit and finds a bad bearing, has to get it from the next state over so he leaves the unit off overnight to not risk damaging the whole fan assembly (which would take weeks to fabricate).
The two adjacent cooling units are now working JUST A BIT harder to compensate and one of them also had a motor which was just slightly imbalanced or a fuse which was loose and warming up a bit and now with an increased duty cycle that thing which worked fine for years goes pop.
Now you're minus two units in an N+2 facility. Not really terrible, remember you designed for 200% of average load.
That 3rd unit on the other side of the first failed unit, now under way more load, also has a fault. You're now minus 3 in a N+2 facility.
Still, not catastrophic because really you designed for 200% of average load.
The thing is, it's now 4AM, the onsite ops guy can't fix these faults and needs to call the vendor who doesn't wake up till 7AM and won't be onsite till 9.
Your load starts ramping up.
Everything up above happens daily in some datacenter in the USA. It happens in every datacenter probably once a year.
What happens next is the confluence of events which puts you in the news.
One of your bigger customers decides now is a great time to start a huge batch processing job. Some fintech wants to run a huge model before market open or some oil firm wants to do some quick analysis of a new field.
They spin up 10000 new VMs.
Normally, this is fine, you have the spare capacity.
But, remember, you planned for 200% of AVERAGE cooling capacity and this is not nodes which are busy but not terribly busy, these are nodes doing intense optimized number crunching work which means they draw max power and thus expel max waste heat.
Not only has your load in terms of aggregate number of machines spiked but their waste heat impact is also greater on average.
Boom, cascading failure, your cooling is now N-4.
Server fans start ramping up faster which consumes more power.
Your cooling is now N-5.
Alarms are blaring all over the place.
Safeties on the cooling units start to trip as they exceed their load and refrigerant pressures rise.
Your cooling is now N-6.
Your cooling is now N-7.
Your cooling is now 0.
Reminds when i did noogler training back in the day and one of the talks described a cascading failure at a datacenter, starting with a cat which was too curious near a power conditioner, and briefly conducted
Its cold up here in the winter, sadly, the residual heat from even totally passive components like switch gear is enough to warm things up enough to attract them. .001% of 1MW of power is still quite warm. (I have no idea how much switchgear leaks but i know they are warm even in winter outdoors).
And, yeah, the rest of the writeup is also an amalgamation of some panic-inducing experiences in my life.
But they did load-shed. Perhaps not soon enough, but the reason this is publicly known is because they reduced the amount of heat being produced.
Right, exactly, I highly doubt the facility went into any kind of actual uncontrolled thermal rise. This is news because they had to take such drastic actions. I'm sure its common that they force spot prices up (probably way up) to compensate for reduced capacity due to events, I'm sure they even sometimes fake no capacity for similar reasons. No capacity means "I don't want to turn on your node" not merely "I don't have any more physical servers I could turn up for you".
This is news because they powered off some non-preemptible customer loads, which actually makes me wonder if you saw that chain of events occur here.
spot prices rise -> new instance availability goes to 0 -> preemptible instances go dark -> normal instances go dark.
A decade ago it was trivial to just tell the hypervisor to reduce the cpu fraction of all VMs by half and leave half unallocated. Now, it's much more complicated and definitely would be user visible.
Some fail below 100% too.
But this is the physical world, shit happens.
The algorithm didn't know that fuse was lose and fine at 50% duty cycle but was high resistance and going to blow at 100%.
It wasn't until a real incident that we learned: (a) the system wasn't resilient to the utility power going on-off-on-off-on-off as each 'off' drained the batteries while the generator started, and each 'on' made the generator shut down again; (b) the ops PCs were on UPSes but their monitors weren't (C13 vs C5 power connector) and (c) the generator couldn't be refuelled while running.
Even if you've got backup systems and you test them - you can never be 100% sure.
Turtles all the way down.
At AWS scale even unlikely hardware events become more common I guess.
> AWS in 2025: The Stuff You Think You Know That’s Now Wrong
> us-east-1 is no longer a merrily burning dumpster fire of sadness and regret.
— https://www.lastweekinaws.com/blog/aws-in-2025-the-stuff-you...
Otherwise a good article!
AWS EC2 outage in use1-az4 (us-east-1)
https://news.ycombinator.com/item?id=48057294
Two loop cycle with heat exchanger to get rid of the heat
But NoVA is basically the same sort of economic cluster that Paul Krugman won his Nobel Prize in Economics for studying, just for datacenters, not factories.
There's a great read about the whole area here: https://www.amazon.com/Internet-Alley-Technology-1945-2005-I...
As for AWS, I often see it repeated that the DCs are the oldest and therefor in disrepair. That's not true; many of the first ones have since been replaced. But there are services that are located here and only here.
But I'll also add, a lot of customers default to using US-East-1 without considering others, and too many deploy in only one AZ. Part of this is AWS's fault as their new services often launch in US-East-1 and West-2 first, so customers go to East-1 to get the new features first.
Speaking as one who was with AWS for 10 years as a TAM and Well-Architected contributor, I saw a lot of customers who didn't design with too much resiliency in mind, and so they get adversely affected when east-1 has an issue (either regional or AZ). The other regions have their fair bit of issues as well. It's not so much that east-1 necessarily fails more than the others, it's that it has so many AZs and so many workloads that people notice it more.
Why is that? You would think the company ending events like IAM going poof due to it being dependent on us-east-1 would be top priority to fix?
If you're building a single datacenter site this is where you start building first.
In one of the slides, there were factors that influence the decision of where to build a data center, and several of the items involved finding a place with enough space and skilled people to work at this data center. He also commented sometimes there is politics involved on choosing the place for a next data center.
Toronto is the textbook example of this working. It's on a freshwater lake that is deep relatively close to the shore, and the downtown has expensive real estate blocking traditional methods.
https://en.wikipedia.org/wiki/Deep_Lake_Water_Cooling_System
Coastal land much more expensive. If you go to a remote coastal site, you probably won't have as good access to power.
Coastal sites usually exposed to more severe weather events.
Other fun unpredicatble things eg-Diablo Canyon nuclear facility has had issues with debris and jellyfish migration blocking their saltwater cooling intake.
https://www.nbcnews.com/news/world/diablo-canyon-nuclear-pla...
Cold water -> data centre cooling loop - > warm water -> paper mill with heat pumps to transform low-grade heat into the required temperatures -> profit
https://datacenters.google/locations/hamina-finland/
> Using a cooling system with seawater from the Bay of Finland and a new offsite heat recovery facility, our Hamina data centre is at the forefront of progressing our sustainability and energy-efficiency efforts.
When customers pay for cloud services, they expect them to be maintained by competent engineers.
edit: Not sure why the downvotes. If you fire the engineers that have been keeping your systems running reliably for years, what do you expect to happen?