I think people underestimate how valuable these reports are, so I’m very glad that detailed investigation is done here. Every major grid operator around the world is going to study this and make improvements to make sure this doesn’t happen on their grid.
In a lot of ways it’s like investigations into airplane crashes.
As someone who lived through the blackout it was wild. I felt back into the pre-internet, pre-smartphone era. It was pretty cool actually. The rumor mill spread so fast that Within hours the official word on the street was that we were getting hacked by a foreign military and people were joking that we had nothing of interest to be conquered xD
I didn’t even know about it until the next day - totally off grid, and starlink for internet access - and no mobile signal where we live to give it away either.
The fact that there is not a single root cause but several ones makes me instinctively think this is a good report, because it's not what the "bosses" (and even less politicians) like to hear.
Frequently, when you see these massive failures, the root cause is an alignment of small weaknesses that all come together on a specific day. See, for instance, the space shuttle O-ring incident, Three-Mile Island, Fukushima, etc. These are complex systems with lots of moving parts and lots of (sometimes independent) people managing them. In a sense, the complexity it the common root cause.
It's like the Swiss Cheese model where every system has "holes" or vulnerabilities, several layers, and a major incident only occurs when a hole aligns through all the layers.
"You’ve all experienced the Fundamental Failure-Mode Theorem: You’re investigating a problem and along the way you find some function that never worked. A cache has a bug that results in cache misses when there should be hits. A request for an object that should be there somehow always fails. And yet the system still worked in spite of these errors. Eventually you trace the problem to a recent change that exposed all of the other bugs. Those bugs were always there, but the system kept on working because there was enough redundancy that one component was able to compensate for the failure of another component. Sometimes this chain of errors and compensation continues for several cycles, until finally the last protective layer fails and the underlying errors are exposed."
472 pages. That's going to be a nice bit of reading this weekend. It is very nice to see such a comprehensive report as well as the fact that it was made public immediately.
In a lot of ways it’s like investigations into airplane crashes.
https://en.wikipedia.org/wiki/Swiss_cheese_model
https://devblogs.microsoft.com/oldnewthing/20080416-00/?p=22...
"You’ve all experienced the Fundamental Failure-Mode Theorem: You’re investigating a problem and along the way you find some function that never worked. A cache has a bug that results in cache misses when there should be hits. A request for an object that should be there somehow always fails. And yet the system still worked in spite of these errors. Eventually you trace the problem to a recent change that exposed all of the other bugs. Those bugs were always there, but the system kept on working because there was enough redundancy that one component was able to compensate for the failure of another component. Sometimes this chain of errors and compensation continues for several cycles, until finally the last protective layer fails and the underlying errors are exposed."
When it's everybody's fault it's nobody's fault.