Problem Management – Finding the Root Cause
Posted September 10th, 2008 by Dennis PowellIf you run out of gas on the way to work, you’ve discovered a problem that occurred because of poor planning. Perhaps you should have gotten gas yesterday when you had time, maybe you should have gotten up in time this morning instead of hitting the snooze button or spent the extra money to fix the gas gauge when you had the chance last month. Although you’ve discovered the problems you’re no closer to a solution are you?
If you’re an ITIL adopter in charge of Incident and Problem Management, you’re probably way ahead of me in regard to this situation. One begins to solve problems when they get to the root cause. And problem root cause is usually detected by evaluating the most immediate and lowest level incidents that cause a problem. For example, the incident that needs to be resolved in regard to running out of gas is just that – your gas tank has no gas in it. Locate gas, buy gas, put gas in your tank, and drive away. You’ve discovered and resolved the root cause of your problem. Now, you may still want to get that gas gauge fixed and ignore that snooze alarm in the future, but those incidents won’t get you on your way today.
Jay Long of Forsythe Solutions Group presented the preceding analogy on Tuesday September 9th at the itSMF Fusion 08 Conference as a lead-in to his session entitled A Deep Dive into Problem Management Troubleshooting Techniques. Jay focused on different methods, from the simple to the complex, for identifying the root cause of problems. Judging from the reaction of the audience, driving to the root cause of Problem (and Incident) Management as defined by ITIL v3 remains a challenge, and Jay’s advice was welcomed.
Jay offered no magic wands to wave over problem management. Driving to root cause requires an understanding of the problem area (the technology, systems, users, processes, infrastructure…), requires people to collaborate without concern about finger-pointing, and above all relies on good ol’ fashioned roll-up-the-sleeves brainstorming.
Among the techniques offered to determine root cause, my personal favorite (as a father of three) is the Five Whys. If you have a child of kindergarten age, you’ll appreciate this approach, or at least recognize it. Groups responsible for finding problem root cause employ the Five Whys approach in the following fashion:
- Define the problem that has occurred.
- Jay gave the example of a server outage that occurred when a new application was deployed
- Brainstorm the causes of the problem
- List every possible problem cause and ask why it occurred
- For example:
- Why did the server lock up?
- A memory leak occurred in the application
- Why did the memory leak occur?
- Because the OS wasn’t patched
- Why wasn’t the OS patched?
- Because Operations didn’t know about the patch
- Why didn’t Ops know about the patch?
- Because there is no patch management system
- Why is there no patch management system?
- Because no one has responsibility for developing one
- Why did the server lock up?
It would have been tempting for the group to stop after the first “Why?” was answered. “Oh it was a memory leak, fine, fix it and lets get back to work”. But the opportunity for future impact related to lack of proper patching notification would remain.
Jay also introduced the audience to other techniques, including structured and unstructured brainstorming, Kepner and Tregoe, Ishikawa dispersion analysis, (or fishbone) and fault tree analysis. Each offered benefits and risks depending on the problem type and organization. As noted above, while none of these techniques will magically find problem root causes on their own, proper application of each technique can help you reduce the amount of time you spend detecting a problem’s root cause, and increase the time available to resolve problems once and for all.
Popularity: 5% [?]
Filed Under: Business Continuity, Downtime, ITIL















September 12th, 2008 at 12:13 pm
September 19th, 2008 at 1:22 pm
September 22nd, 2008 at 3:36 am
[...] Vortrag während der itSMF Fusion 2008 von Jay Long (gesehen im Blog IT´s About Uptime) [...]