October 16, 2010 1 Comment
I had a bit of a health scare this week and a trip to A+E (ER). All’s OK now but the trip made me realise some of the similarities of the “bug fixing” the great doctors/nurses were attempting on me and how a good engineer will address a problem. Most of these concepts work in any field of engineering but I’m going to focus down on IT Operations more specifically.
#1 Symptoms and Cause
It’s important to remember the difference between symptoms and cause. Treating backache with pain killers will be useful in the short term but you’ve got to identify what’s causing the pain: posture, your desk chair, etc
Making sure you understand what the root cause of the issue is should be your ultimate goal. In the short term treating the symptoms might be best to get your system back up and running quickly.
Both trend monitoring and threshold monitoring are amazingly important when it comes to identifying and resolving issues. This is why patients are so often hooked up to pulse, ECG, blood pressure monitors and why key readings are recorded regularly.
In engineering perhaps the CPU usage of the server you’re working on looks high: Is it normally this high? Is the trend that it’s increasing/decreasing?
Be sure to use tools like Cacti, Ganglia or Nagios and graph everything that’s service or business critical. This could include technical data like CPU usage, connection counts, cache hit rates as well as business data like: user logins, registrations, eCommerce basket value. I’d argue that having a little too much data is far better than having too little.
When you’re presented with multiple problems you’ve got to identify which of them is more critical? Allow users to assign priority or assign one yourself in triage. Perhaps use a defect matrix to assign this according to how many users are effected, whether it’s on a production site, whether there’s a workaround or not.
This way you treat the most business critical problems first and not the ones that are most interesting!
#4 Case history
Doctors will talk with you about when this problem first started and ask related questions which might be of use with their diagnosis. Good bug reports are often critical for you to be able to fully understand and replicate the bug. It’s important that the reporter of the bug understands this through training or are forced to give detailed info in the reporting process. PHP’s Report a Bugpage is a reasonably good example of the latter.
If you can keeping some kind of history of changes/problems relating to a device or system can be really valuable. A well searchable bug/ticketing system is somewhere close to self-documenting and I’d strongly recommend version control of all server configuration files.
If you’re getting nowhere with a diagnosis of a problem get a second opinion. If, after gaining a second opinion you’re no closer to identifying the problem then it could be worth the second engineer going through the same steps of diagnosis that you did and not just taking your word for it. Sometimes a second set of eyes will spot something subtle that was easy to miss.
Happy bug fixing!