Someone I was talking to recently used a quote from Stephen R Covey, and it made me think about my experience of measurements in teams or businesses: “If the ladder is not leaning against the right wall, every step we take just gets us to the wrong place faster.” Stephen R Covey
Measurements of people, teams and businesses are a critical ingredient for improvement and component of success, but it’s possible that our measurements are inadvertently and unexpectedly promoting the wrong behaviours.
Let’s start with straightforward example of contradicting measures. Imagine a Service Desk team that are measured on two KPIs, being: First-Time-Fix and Number of Incidents handled. What does our engineer do if the phone rings while troubleshooting an existing issue? No doubt the best they can with the information they have, but it’s not going to be a fair fight. It might even be perceived as intentionally misleading to maintain control as no matter their focus, they are failing at something.
A more subtle example. Imagine a tech-ops team responsible for, amongst other things, the up-time of a platform. Normal measures we’d see here are:
- Systems Availability (uptime) attainment
- Incident resolution time (i.e. mean time to resolve/MTTR, mean time to diagnose/MTTD)
- A reduction (or threshold) of repeat incidents
On the surface, this is clear: priority 1 = up-time. If there’s a failure, then priority 2 = fix time. But is our ladder leaning against the right wall? To answer that we need to look at the behaviours we could promote.
Situation 1. We hit 100% availability. Awesome! Customers are happy and teams have had some sleep this month.
What do the team learn or do as a result of this?
(depending in part on how the target was achieved), what’s the impetus in the following period to prevent issues? How quickly will attention move to the next fire or project deliverable, inevitably risking the reliability of this system? How readily can people fall into the trap of (unconsciously) marking that system ‘done’ and ‘solid’? How possible is it for a leader to wonder why prevention/maintenance is necessary as service is OK?
Situation 2. We (just) hit our availability SLA. Well that’s OK, because according to our contracted services we’ve delivered. No doubt teams have worked hard to bring back service when it did fail, and we’re grateful.
What do the team learn or do as a result of this?
Like Situation 1, our team’s attention could wander from the availability KPI (we’ve achieved! What’s left to do?). We also recovered quickly, so it’s fairly likely we patted some people on the back and congratulated them for their efforts. If that’s the case, what could engineers learn? That fixing issues is more highly prized than preventing them? Could we show others in the company that heroes coming to the rescue are lauded? If overtime or bonuses are paid to the team for their work, is it a good or bad month when there aren’t any issues?
Situation 3. We miss our availability SLA. There’s no real winner here. Customers are unhappy, leaders and individuals are under pressure, people are likely tired and stressed.
What do the team learn or do as a result of this?
No doubt, we’ve got some disheartened people moving around, alongside some drive to do something about the result. We have the same risk as Situation 2 that we celebrate the heroes despite these results, and teach people that being a hero at midnight is a cool, rather than a solemn thing to do. We might also see some panic and knee-jerk action – let’s roll back all the changes! Buy more infrastructure now! or even target people.
What to do instead (or at least, as well)
We need something that helps people to maintain focus on the outcome, regardless of the result.
Perhaps here we could choose something like “# critical incidents prevented” here. If the team prevents a high number of incidents, that’s great (and if they act to improve systems, noteworthy). If the team don’t prevent any critical incidents, it might just inspire a question of how effective the visibility is – especially knowing that no system component is 100% available.
Two points to note:
- If these measures are incentivised/top-down, and not used as team-led initiatives, there is a risk that the data could be gamed (just like measuring velocity between teams).
- Be careful and avoid measures of activity, for example ‘number of hours’, or ‘number of checks’ or ‘number of lines of code’. These can be useful to change (normally by force) some behaviours, but the result doesn’t correlate with the outcome.
If it helps, think of the measure you need as a thermostat not a thermometer. Thermometers provide status data but don’t intrinsically drive a behaviour. A thermostat measure is a leading indicator that should – by its existence – encourage creativity and thought to deliver that result.
How to do it (a suggestion)
Ask these questions (the answers for uptime in brackets)
- What, specifically, is the outcome we want? (High availability)
- What do your Customers say about that outcome? (Expect this of us)
- How do we want teams to behave? (Proactive, prevention, rapid response)
- What is a measure that demonstrates achievement of the outcome? (% Uptime)
- How could people behave if the result is good? (motivated and able to take new work on, avoid/deprioritise system maintenance, become complacent, prioritise other work)
- How could people behave if the result is bad? (Disconnected and low, ‘Hero-mode’, knee-jerk action)
- Evaluating the results of good / bad, is it likely that you’ll get the behaviour you want to support your outcome? (No)
- What is a measure that focuses attention on the outcome, regardless of achievement? (# critical incidents prevented)
- How could people behave if the result is good? (autonomous system design changes, monitoring changes)
- How could people behave if the result is bad? (health-checks, autonomous system design changes, monitoring changes)
- Comparing the results of good / bad, is it likely that you’ll get the behaviour you want to support your outcome? (Yes)
Is your ladder leaning against the right wall?
Leave a Reply