The Toyota 5 Whys
I was reading the book “The Lean Startup” the other day and learned about this problem solving technique called “the Toyota 5 whys“. You can google it for the detailed description. Basically when you are trying to find the root cause of a problem, don’t settle when you’ve identified the cause. Instead, repeat the process to see if you can find the cause of the cause. The theory is, if you iterate this process five times, you’ll find the true root cause that you should really fix.
Here’s an example taken from the Wikipedia:
The vehicle will not start. (the problem).
1. Why? – The battery is dead. (first why)
2. Why? – The alternator belt is not functioning. (second why)
3. Why? – The alternator belt has broken. (third why)
4. Why? – The alternator belt was well beyond its useful service life and not replaced (fourth why)
5. Why? – The vehicle was not maintained according to the recommended service schedule. (fifth why, a root cause)
6. Why? – Replacement parts are not available because of the extreme age of the vehicle (sixth why, optional footnote)
Therefore, instead of just replacing the battery, what should be done is maintenance of the vehicle according to the maintenance schedule. If you just replace the battery, it will die pretty soon again since the alternator belt is still broken.
I realized that this technique is very useful in fixing software defects as well and that I was actually doing it subconsciously. Here are a couple of real life examples.
The first is when some time ago QA assigned me a bug about a NullPointerException and asked me to add null check before dereferencing the variable. Sounds easy, right? Not so fast. Looking closer at the code where the NullPointerException was thrown, I saw two problems that a null check won’t fix. First issue was that the code block encrypts some sensitive data. By just adding a null check, the code will skip the encryption, which is probably not what we want. Second issue was that the value is one of the predefined Enum constants, which shouldn’t be null to begin with. Tracing back the call stack, it turned out that there’s a new constant added to the Enum class. But the code that translates a string value into a Enum constant hasn’t been updated to handled the new enum value. So the translation code returns a null if the string maps to the new Enum value. That’s the root cause of the NPE we should fix. If we just add a null check around the code where NPE is thrown, more than likely we will just end up with a NPE again because the null value is still there and will probably be used somewhere else. Using the 5 whys approach, the process will look something like this:
The program throws a NullPointerException (the problem)
1. Why? – Because the code dereferences an Enum variable whose value is null.
2. Why? – (is the value null)? Because the code that translates a string into an Enum returns null for the given input.
3. Why? – Because a new constant was added to the Enum class but the translation code hasn’t been updated.
There’s no “5 whys” but you get the idea.
In another case, I was working on a production bug. The issue was that the operations team noticed a sustained 100% CPU usage in the API servers. The operations team has done the initial triage using the java thread dump. They found that the issue was caused by multiple threads accessing a HashMap, which is a static member of a class, for both reads and wites. HashMap is not thread safe. So the recommendation was to either add synchronization or use ConcurrentHashMap instead. But locking/synchronization is the evil of all scalability issues. So I’m not quite satisfied with the solution. First I wasn’t convinced that concurrency issue can lead to CPU spin. More often it’s linked to thread liveness issues (dead locks for example). But the operations people showed me some links which point out that the implementation of the HashMap DOES have a busy while loop in it (not in Openjdk 1.6 though as I checked the source code). Then I wondered what the HashMap was used for. After reading the code, it appears that the HashMap is used in a JMS listener to track the number of times a particular message has been delivered. If the count goes over a predefined threshold, the message will be ignored. So it’s essentially a in-house implementation of duplicate message detection mechanism. Unfortunately this implementation is broken. It only works in the case of a single JVM because it uses a static HashMap to store the count. But our deployment has multiple application servers and the code won’t be able to count the messages that are sent to other JVMs. Then my follow up question was why do we roll our own implementation after all? Doesn’t ActiveMQ, our message queue system, already have the duplicate detection feature built-in? So it turned out the real fix was to remove the HashMap code completely and let ActiveMQ handle the duplicates detection. Again, applying the 5 whys technique, the process will look something like this:
API server has 100% CPU usage (problem)
1. Why? Because the JMS listener code uses HashMap which isn’t thread safe.
2. Why (use HashMap)? To track the number of times a message is delivered and detect duplicates
3. Why? Because the developer wasn’t aware of the fact that ActiveMQ has this support built-in
As you can see in the examples, jumping to a solution too quickly will lead to a premature fix which does nothing but hides the real problem. The real problem will probably just resurface in a different form until it’s fixed. So next time when you think you’ve found the fix, stop and think for a moment. Ask yourself (or whoever appropriate) a few more whys to see if there’s some deeper issue lurking around.
Not everyone believes in this method though. Check out the criticism section of the 5 Whys on Wikipedia, which I mostly agree. But that’s not to say there’s no value in this method. People have the tendency to stop whenever they think they’ve solved the problem. Having this 5 whys mentality helps you overcome this tendency and get closer to the true root cause of a problem.
Last but not least, I have to admit that I haven’t applied the 5 Whys methodology to the true purpose of this method, which is to identify the process failure, or the lack of a process. “people do not fail, processes do”. So in the first example, we can keep asking whys and the real fix maybe that our QA or unit test process should be improved to catch the problem. In the second example, maybe our code review process is lacking. These are bigger issues that require a lot more collaboration and team effort. It’s a work-in-progress as we are marching toward the continuous delivery model.