Troubleshooting With Battleships

Troubleshooting, synonymous with (complex) problem solving, is something that all of us do, and for some of us, multiple times each day.  It could be we need to find the root cause of a fault in some infrastructure or software, help a Customer get something done, discover why our (very expensive) satellite isn’t doing what it should, fix a car at the roadside, find the cause (or the type) of illness in a patient, identify why a team is performing particularly well (or not), track down why people aren’t using a new process, why there is water where there shouldn’t be, or just about anything else.

Given that problems are common, it’s no surprise that the World Economic Forum “Future of Jobs” lists “Complex Problem Solving” as the #1 skill people need: https://www.weforum.org/agenda/2016/01/the-10-skills-you-need-to-thrive-in-the-fourth-industrial-revolution/.  Indeed, if you concentrate on this skill when hiring*, and bring in someone who you know can solve problems effectively, I can confidently predict that they can prove useful anywhere in your company.

[*I strongly recommend you test all candidates for how they think about and approach problem solving, regardless of role.  Ideally get them to do an exercise or ask hypothetical questions that encourage them to show you how they think]

Despite this frequency and necessity of skill in problem solving, many people rely on potentially unreliable, unconscious habits that they’ve developed over their lifetimes.  Very few people that I have ever met can describe the process they use to troubleshoot, nor have they ever been taught a repeatable method of troubleshooting.

I’m going to describe four unreliable troubleshooting habits I regularly see and then some suggestions that will help you avoid them.

Habit #1: (Playing) Battleships

This is the habit of trying things in a seemingly random way until an answer is found (i.e. trial and error)

Some things you might see:

  • Frequent attempts at fixes being made without success
  • Fix attempts unexpectedly causing further or different issues “in the blast radius”
  • Prioritisation of action above all else (rushing to do something, anything! whether wrong or right)
  • People taking conflicting actions in pursuit of a fix
  • People repeating actions that may have already been taken, with the same result

Some things you might hear:

  • Let’s try changing this!
  • Let’s restart it and see what it’s like after that!
  • I’ve tried replacing component X but it didn’t work, so now going to try Y
  • I’m going to restart components in this order to find the fault
  • Just do it! It might work / worth a try

Results:

Results vary wildly.  People do get lucky and find cause/fix quickly, but often recovery is delayed.  Worse, charging in to apply fixes could well cause further problems, and even once we do find a fix, there’s a good chance that records of actions taken and the problem itself will be lost.   If it recurs we don’t know why our fix worked.

Habit #2: Surfing the Availability Heuristic

Actions are determined by recent or common knowledge regardless of current facts.

Some things you might see:

  • Attempts at fixes being made without success
  • (Stubborn) attempts to link the current issue with another recent problem
  • Recovery procedures being followed that don’t fix the issue
  • Blame! (Based on knowledge that other teams are making changes)

Some things you might hear:

  • We’ve had this exact same issue before. I know what to do!
  • That Customer also uses technology X, must be related
  • Team A are doing the upgrade this week / Vendor X has been releasing patches. It’s probably that
  • I’ve read that component X is unreliable, must be it

Results:

The results are like Battleships in many ways, excepting that there is some perceived science in this approach.  Indeed, it could be that it *is* the same as it was last time, and if so we’ve got a fix very quickly.   Otherwise, we have the same risks as Battleships:  fixes causing further problems and poor record keeping.

NB for further reading about the availability heuristic, see https://en.wikipedia.org/wiki/Availability_heuristic

Habit #3: Technology Superheroes

Relying on one/few technology experts – superheroes – to solve issues and come through for the team.  Of course, many people have been through technical courses, have learned technologies in detail and tools to help them find out that problems exist, but perhaps less so what’s causing them.

Some things you might see:

  • Limited record and shared understanding of problems that do occur (concentrated knowledge in few)
  • Delays caused by unavailability of superhero
  • Other people fearful of getting involved in the troubleshooting
  • Limited collaboration
  • Heroes put on pedestals

Some things you might hear:

  • Has our superhero been informed there’s an issue?
  • We’re waiting for our superhero to be available to fix this!
  • Oh no! Our superhero is on holiday today…

Results:

Assuming our superhero is available, recovery time may be good, and given they hold concentrated knowledge of the technology, they may be precise.  However, these statements are loaded with risk, and in my experience sharing of information about issues/actions is at its worst in this habit.

Habit #4: It’s up to the vendor!
Reliance on vendor engagement/escalation to solve issues in technology at the expense of any local understanding at all.  This can also apply to resolver groups within organisations.

Some things you might see:

  • Little or no recovery activity in advance of instruction by a vendor
  • Delays caused by agreement (vendor support not available until time X)
  • Communications calling out a vendor as both cause and saviour
  • Scripted actions that are unnecessary and waste time
  • Fixes don’t work, some trial and error

Some things you might hear:

  • It’s always the same with technology X!
  • I’ve escalated the problem to the vendor, and we’re waiting on a reply
  • The vendor is waiting for their engineering team
  • I’ve sent the logs and waiting on a reply
  • It’s their issue, not mine to fix

Results:

Again, we have variability.  If the answer to the specific error is in the vendor knowledge base, we’ll be back up very quickly, but when its not, typically troubleshooting follows a script.  That script – and wider troubleshooting – makes assumptions about your environment that can lead to unnecessary actions and failed fixes.  Data records about the issue are likely to be good but held within the vendor’s system, so we don’t improve.

So, what could we be doing instead?

If these habits here, along with a measure of luck, aren’t reliable, what is the alternative to achieve more consistent and less wasteful results?

  1. Take a facts-based approach.

When an incident occurs, focus on gathering and recording what you know.  What is happening?  How do you know there’s a problem?  When did it start?  Is it impacting everything or sub components only? What is the impact that it is having?  What don’t we know, but would really help us?  This will help you to isolate the issue and limit the scope you are working with.

Once you have some facts, use that information to guide your hand.  If you had an idea of what’s causing the problem already, or you want to apply a fix, would it make sense given what you know?  If it doesn’t, why not?

Apply your fix and maintain a detailed record of the whole incident.  If the fix works, brilliant! (and then look to see if there are other things that could be prone to that issue).  If it doesn’t look at other causes, or develop your facts based on what you now know.

  1. Practice and teach troubleshooting!

Whether through courses, or staged/chaos-monkey scenarios, give your teams opportunities to practice and to reflect on their results (how they solved the issue, rather than the solution itself).  If people go to Battleships mode, ask / show them how they could do it differently.  If they gather facts before action, congratulate them.

  1. Identify and support your superheroes

Identify individuals that you rely on to fix specific technology issues, and then act to back them up with other people.  This could be in the form of training/pairing of colleagues to develop knowledge in others or, if the cause is a technical one, i.e. the only person with the skill on that system, look at ways to use more widely understood alternative technologies.

  1. Make troubleshooting a core skill and use a common approach

To really get a meaningful result in consistency, MTTR or related service measure, the change must be deeper: training and integration of consistent troubleshooting methods, processes and skills.  This is not an easy thing to do but there are organisations that provide training and coaching to help this kind of programme get started and stick.  Talk to them and see how they can help you.

Leave a Reply

Your email address will not be published.

Share This

Copy Link to Clipboard

Copy