How Complex Systems Fail
In light of the recent IT outage caused by Crowdstrike and the ironic timing of my reading Cook’s work on human error in complex systems, I have decided to detail my own thoughts and research on this debacle.
2. Complex systems are heavily and successfully defended against failure
"The high consequences of failure lead over time to the construction of multiple layers of defense against failure."
My first question was: How did the build-test pipeline even let this occur? Surely any changes made would have been build and tested in the different Windows OS versions it supports so it doesn’t crash. Right? Doing some digging in forums, it seems to be that it wasn’t a build error that occurred but instead caused by faulty error handling for deserialization (of course). The current theory is the CSAgent.sys failed to handle malformed files or they pushed the nulled C…32.sys.
After some more digging, it seems to be that it was a malformed file pushed by Crowdstrike themselves and when read by CSAgent.sys blows everything up. Now the next question is why was the file malformed and was the error handling non-existent? I mean for a kernel driver I would assume there has to be some standards of requiring error handling because if not it will just crash the OS… right? On top of that was there not a process of handling the addition of channel files to ensure data integrity for them?
My line of questioning here is to highlight the #2 point. Kernel and other drivers have been in development for decades. I’m (hopeful) sure people have realized the best practices and common playbooks for building/testing/linting them in such a way that they wouldn’t go kaboom to take down the world’s IT infrastructure.
https://www.crowdstrike.com/blog/falcon-update-for-windows-hosts-technical-details/
7. Post-accident attribution to a ‘root cause’ is fundamentally wrong
"Because overt failure requires multiple faults, there is no isolated ‘cause’ of an accident."
I think this is the best mindset going forward, off-course some heads will roll due to the nature of the incident. Still, fundamentally I still believe the higher processes and systems were more likely to be the fault. In other words, the software developer inherited the system defects rather than injected them. I won’t speculate on this because the facts of the matter are not entirely clear either.
Conclusion
- Write more tests and faithfully validate the states of your software.
- Edge cases are very real, but they go uncaught.
- I’m interesting in seeing how litigation goes for this.
Citations
Allspaw, J., & Cook, R.I. (2010). How Complex Systems Fail. Web Operations.
Patriarca, R., Bergström, J., Di Gravio, G., & Costantino, F. (2018). Resilience engineering: Current status of the research and future challenges. Safety Science, 102, 79–100. https://doi.org/10.1016/j.ssci.2017.10.005