I like to read incident reports, accident investigations and (technical) post-mortems. Maybe a bit too much. But I feel like I’m getting a lot out of it, both for my professional work as a software development manager and my volunteer work as a lifeboat helm. Here’s why.
Software development is a pretty young profession compared to many other professions, that are equally integral to our everyday life. It can have a big impact, and even though we (officially?) have moved away from the whole “move fast and break things” motto, it still very much breaks. Things, but thankfully mostly just itself.
Other industries are older and more regulated. That comes with some interesting learnings and side effects, which we can and should learn from.
Written in blood
Very tactile and visual industries to learn from are civil engineering, the railroad- and the aviation industry. If things go wrong here, you usually have a pretty good idea of what could happen even as a layman - be it a collapsing bridge, a derailing train or a rapidly descending aircraft.
Regulations in these areas are written in blood, and I’m personally not sure if I could stomach writing code that controls an airliner. If I had to, it would be at a glacial speed for sure.
Looking at these industries I see one big commonality in their regulations. Of course you cannot build an aircraft hull with the same requirements that a single-family house wall needs. But all these industries have independent investigation bodies to examine the root causes of catastrophes - and ideally also of near misses. And these investigations should result in a public report, with concrete change requirements for the involved organizations and often recommendations for improved regulations.
My favorite “incident-report-based” media
I usually don’t read the original reports myself. They are long, detailed and technical in areas that I am not an expert in. But there are experts out there, who create blog posts, podcasts, or videos about the incidents with the help of those reports. They make them available to the wider public, including me, and I really appreciate that.
The Causality Podcast by John Chidgey looks at incidents, their root causes, and how they could be prevented, from a wide array of industries. If you only have time for one of my recommendations, this one is probably the way to go.
John is an electrical, instrumentation and control systems engineer and software developer and feels comfortable covering incidents from chemical plants to aircraft, from offshore oil platforms to rollercoasters – and that’s just what I can remember off the top of my head. The types of incidents vary widely, but he usually covers the technical measures that could have prevented the event AND maybe most importantly the organizational failures that allowed for or even provoked “human errors”. While I was young/studying I often thought I was smart and logical by focusing on the technical aspects, but after working for some time I understood more and more how much “soft” factors like organizational culture, incentives and structure, play a role in nearly every “technical” decision, at least in the background. This podcast is definitely aware of that.
But, despite all organizational factors that might contribute to a wrong engineering decision or a fatal oversight: In the end, it is the engineer/developer who decided to implement and “sign off” on it. And so many episodes end with a plea for engineers to push back on these decisions, also if it might seem bad for business in the short term. This professional self-respect and acknowledgment of your responsibility as an engineer is something that we in software development still need to embrace more.
Causality is the podcast I find myself skipping back in most often, sometimes repeatedly, because you really need to concentrate to keep up with the technical information presented via audio. An example of this would be the episode on the Hyatt Regency walkway collapse – even though the crucial image is also in the chapter art.
Episodes that stuck in my mind especially:
Videos: Mentour Pilot
The Mentour Pilot YouTube channel covers aviation accidents or near misses. Presented by commercial airliner pilot Petter Hörnfeldt these videos are based on the often excellent aviation incident investigation reports. While the actual flights are re-enacted as closely as possible in an airplane simulator, the root causes behind the incident are explained. As in Causality, the root causes are more often than not a combination of different factors (cp. “Swiss Cheese Model”). Technical decisions by the aeronautical engineers, organizational shortcomings, bad training of ground crew or pilots, or operator fatigue - seldom is it just a single cause, which the excellent investigation reports also reflect by listing tens of recommendations for airplane manufacturers, air traffic controllers, airlines and regulatory bodies.
Mentour pilot looks at these incidents naturally a bit more from the operator’s perspective compared to Causality.
I often find myself fascinated not by technical solutions, but by the focus on “Crew Resource Management” (CRM). CRM is a set of techniques and procedures, and I think also a state of mind, that tries to protect against “human errors” taking into account psychological and sociological research. Concrete examples of good CRM would be to
- encourage anyone, no matter their standing in a social hierarchy or position, to speak up when they feel uneasy about a situation or think they observed a pilot error
- keep conversation topics during the “non-sterile” phases of the flight light
- communicate clearly when executing certain actions or taking over control
- good, proactive workload management
In my view, CRM shines at the tactical level. In software development this could be during the incident handling of an outage or even during the investigation of an active security incident. Working in an IT security company I can definitely see how this can apply to our SOC and our incident response teams.
I also took a lot of things from this for my volunteer work as a lifeboat helm – especially in bad weather you want all available crew resources focused on the task at hand and want everyone to feel comfortable speaking up immediately when they see something. At the same time you need to make time-critical decisions and communicate them.
Further shout outs
I also want to give a shout out to one-off talks and some other “incident investigation report”-based media.
(Not just) for the train enthusiasts there is Max S’ blog based on train accident investigation reports. With an impressive weekly cadence, he describes train accidents and their root causes in detail. As he covers accidents happening between 1882 and 2021 you really get to appreciate modern safety standards and technology.
“Who Destroyed Three Mile Island?” by Nickolas Means at the “The Lead Developer 2018” conference in Austin is an interesting talk, covering also how to look at an incident after the fact to really understand it, and in the end – spoiler alert – also covers CRM (without naming it).
Another recommendation would be Brick Immortar. I mainly watch them for their maritime incident videos, based on the U.S. NTSB (National Transportation Safety Board) or Coast Guard investigation reports.
Keep an open mind. Of course, you can become a better software developer by reading blogs about programming – but you shouldn’t stop there. There are opportunities to learn something that you can apply to your professional work everywhere. And it makes for more interesting conversations.