UndoIsImportant

September 04, 2003
This op-ed piece by Henry Petroski regarding the ‘03 shuttle disaster has a salient point for software design:
Much of design is thus defensive...: containing, shielding and fending off anticipated problems ... so that they cannot bring down the design when it flies. Obviously, total success can only come if every possible mode of failure is identified and defended against.

I like this statement, yet, I'm somewhat anti-anticipation. I have entries about SoftwareAppreciation that detail some of the crazy ways software fails and try to give a good overall picture of how complicated even simple software is. I'm a fan of AgileDevelopment which is built on the premise that complete anticipation is impossible. Thinking of everything that could go wrong is very difficult, and even if it were possible, the resources needed to specifically protect against everything that could go wrong makes the task pragmatically impossible.

Which brings me to the importance of undo.

My current software baby is a timecard processing system that's been successfully running in production for a few years. If Dr. Petroski is correct, we must have defended against every possible mode of failure, right? Well, sort of.

We've run across many problems with this software during its years. Some quite normal, some very weird. Probably all were never anticipated. But what's saved us in every case is the ability to start over.

If at any point along the line a timecard gets stuck because of an invalid piece of data that was not anticipated, we can analyze the data, find what's wrong, delete the timecard and start over. What's important is that we can do this at *any* point along the data pipeline, not just during initial data entry. Unfortunately, due to inherent complexities in the existing system, the data pipeline is rather long and has a few right angle joints along to the way, but we were careful to put rollbacks throughout this system.

We defended against all failures not by anticipating each one individually, but by allowing for a way to start over. Any failure can be deleted (so far :)

Obviously, as indicated by the subject matter of the initial quote in this article, starting over is not an option in some systems. But if the option exists, as it does in many software applications, it is an option that can effectively defend against the unknown.

see also RedoIsImportant


tags: ComputersAndTechnology