Eppur Si Muove

 
 
I'm going to mention a principle which is part of the long term construction of software systems.  This principle centers on the following contradiction: fixing bugs doesn't make your software any better.  Well, that's a strange thing to say.  Of course every bug fix makes my software better.  That's one less bug hiding within the constantly changing innards of my system.Well maybe I should phrase it like this.  Every bug one fixes makes ones software better but only when the bug is fixed in such a way that it provides the answer to the question, "What do I have to do to make sure this never happens again?"  This is the supervising question behind all of the child bug fix questions that I will list below.  But before I get to that list, I think I'll talk about a few stories of the applications of the supervising question.Many years ago, I inherited the batch and reporting system of a growing financial institution.  Important but not the most glamorous work.  Think, telephone calls at 1 - 2:00 in the morning to the effect of "xxx batch job failed, I don't see any error messages but the whole batch system is down and I cant get it restarted."  Those calls happened and problems had to be solved that night for the company to be up and processing the next day.  And then there were nightly failures of the non show stopper jobs and reports.  Everyday I would come to work and spend the whole morning tracking down the cause of failures from list of broken jobs that could "wait till morning".  In this kind of situation its a good thing to keep an issue notebook and write it in every problem that occurs.  What are the symptoms? What are the error messages? How did I solve the issue to get the processing done?  What did I try?  What do I think the root cause is? A record is nice to have if one is tracking problems over time.  Sometimes problems are related or they will reoccur again in other contexts.  Your notebook will ultimately help you over time as you add the final resolution to each entry by answering the question -- What do I have to do to make sure this never happens again?  I can pretty much guarantee that if you have a system which averages 4-5 failures per day, you will have a system which averages 1 failure per month after a year.  That's not perfect but its about a two orders of magnitude improvement, and that's not bad either.Once you are at that point you can get past just thinking about system failures.  Because system failures are just brutish things that hit one over the head and say fix me.  "What do I have to do to make sure this never happens again?" demands a different attitude to constructing systems.  Here is another example.  Shortly before I left the above mentioned financial organization.  We had a batch emergency.  It seems the batch operator at 4:00 in the morning killed the invoicing process.  He couldn't see any activity in the logs and as far as he could tell the process was hung.  He called the admin on call and Frank said "kill it".  A bad call?  Well, it meant that when the system came up the next morning we couldn't tell our partners how many millions of dollars they had to send us that day.  It also meant that we had no idea what financial data had been processed and what data hadn't.  Was the db in an inconsistent state?  The problems turned out to be tractable, the program hadn't hung; it was just a heavy processing day and after tracing the each thread through the logs we figured out where we stood.  That was a bad morning for Frank I think, although I don't know what transpired in meetings that morning, but I remember Frank coming up to me in the afternoon, heavily affected, and saying "I really sorry about screwing up batch."  "You know Frank, Its not your fault.  The fault is in the design.  This program ran in such a way that its state of execution was indistinguishable from a program that was hung.  Not only that but batch is so complicated that no one knows what jobs do what.  What jobs can be killed safely.  And what jobs cant be killed safely.  The only people that can figure it out are the engineers who have to work with it all the time - and not even we can make decisions with confidence.  The management of the software group has known all this and has decided to allocate resources elsewhere.  This is a failure of the organization, you did the best you could have done, and that's fine."  My point here is that when people fail while operating complex systems, its the systems fault.  Its the designers fault.  And they have to ask the question.  You know the question?I said before that "What do I have to do to make this never happen again?" is the supervising question of system improvement.  But sometimes it helps to have a bit more guidance that will lead one to the supervising question.  So here are some derivative questions that I have listed in my little notebook that I usually carry with me.  Here we go --Can this type of problem be automated away?What could have been done to catch this error right away?Does this error belong to a class of error?  What is that class and how can I uncover them? remove them? prevent them?How could this bug have been caught automatically?How did this bug get into the system and how could we eliminate that entry route?What is the root cause of this error?Could I write a test to prevent this bug from being reintroduced? Its variants?Could I introduce a new type that would block errors like this from being introduced?Is this caused by a design weakness?  What is a design that would be fail safe?Can I eliminate this error by eliminating human interaction with the program?Lastly learn from other peoples failures.  Amazon S3 went down worldwide for eight hours a week ago.  It seems the ultimate cause was minor corruption in messages sent from one part of the system which eventually cascaded through out the whole system.  I thought of this this week when I put together tar file for deployment later this month and added a md5 checksum file to the distribution.  That tar file has to be sent from Shanghai to Austin on deployment day and the software is mission critical, as they say, and, yeah, we should check just to be sure once it reaches its destination.  What the chance of minor corruption?  But one should get into good habits even; one day good habits will be good to have.  So one more question should be added to the list here.How would I prevent something like that happening to me?Well that's a lot of questions.  One cant always do everything perfectly and I guess fixing everything perfectly is a tall order.  But that's the basics as I see them.  Its only one question -- what do I have to do to make sure that never happens again? Do your best.