Failure modes revisited

I ran a retrospective lately, to look at when something went wrong, rather than just looking at a sprint and asking what was good or bad.

In doing so, I went though how something had failed, but I realised that I was drilling into what had happened, not what could have happened. This is OK, since we get to the most recent cause, but it risks uncovering only the most recent cause, not the common causes that are creating multiple issues.

So I changed tactics a little and broke our workshop into two parts

Our response strategy – how we react and mitigate issues like this
Our prevention strategy – the controls and practices that help us prevent future issues

In doing so I realised that I was almost implementing “FMEA- Failure Mode Effect Analysis”.

If you have ever used FMEA, Poka-yoke, 9-questions or other approaches, you know that they can be used for risk management (what could go wrong?), User design (how might people do this differently?) or post mortems (what caused this?).

I reflected on how my workshop could have run, using FMEA to look at the causes and potential responses.

I find my pretend workshops are always more efficient and awesome than when I run them with real people facing real challenges, so I will use a pretend issue to illustrate the workshop flow.

Step 1 – Set the scene

State our purpose

Obviously being a good facilitator I clearly outlined what we wanted to achieve.

“We are here for the next hour to create a joint understanding of the “meeting room incident” last Wednesday. We will share what happened and then, if we have time, we will also cover how to precent similar crises in the future.

Agree on the problem

It seems obvious, but if we don’t agree on what we are analysing, then we can get lost in the conversation.

So I explained the incident and checked for people’s understanding.

“On Wednesday at 3pm I went to use the whiteboard in a workshop. I found 2 green pens, 1 red pen and no blue or black whiteboard markers. When I tried the red marker it was dry and useless so I put it back. This was repeated for a green pen.” Frustrated, I went to the back of the room and grabbed a black marker.

I wrote on the whiteboard and then after a while tried to erase my writing. Unfortunately this was when I discovered the black marker was a permanent one and could not be erased. Fortunately that is when Martha grabbed a bottle of champagne, opened it and sprayed it over the whiteboard. We were then able to wash the marker off with paper towels and then continue. It was expensive champagne but it was great that we were able to continue with the workshop”

OK – you might use a better problem statement, but at least mine is a description of what happened. There are several failure modes here, so we could easily get lost in any of them.

For those not familiar with a failure mode – it is a mode or a way something could fail. For example my marker might fail because it is the wrong colour, it has no ink, it is permanent and not fit for purpose, it leaks on my hand.

Each “way the pen could fail” can have multiple causes and multiple solutions. So in safety critical situations it is worth investigating all of them. For my meeting crisis, though, there might be diminishing returns in assessing the potential failure of the pens, the whiteboard, the cleaning approach, the workshop etc.

Step 2 – Ananalyse our response

Ultimately, we want to prevent these problems, but while people are focused on that they might miss some lessons in how we respond to incidents like this.

On the other hand if we can refine our ability to respond, we can pre-program our response with checklists, emergency stationery teams or other solutions before the next crisis occurs.

Of course the conclusion we want to come to is:

How can we detect things earlier?
How can we make the response easier and more predictable (since people make mistakes when stressed)?
What clean up should we do in order to get back to normal?
Should we communicate anything, educate anyone, track anything for a while?

My pretend workshop

In our workshop I asked

When did we notice something was wrong? How long before we noticed?
What did we first think was happening?
What did we do?
- What impeded our progress?
- What led to breakthroughs

Our conclusions were

We need to clean up a little to get rid of the bottle and the napkins.

In addition we can
- Change our process to throw out non-working pens, as well as coming to the workshop 5 minutes early to test the stationery and zoom connections etc
- maybe have a spare black and blue marker,
- Identify permanent markers better,
- Create instructions about how to clean the board with water or whiteboard markers (which erase permanent ones) so we don’t resort to expensive beverages

We don’t need to record the time to resolution (4 minutes) or the time to detection (8 minutes or writing, but the risk had been there for days)

In the real world

In the real world, what we want to learn is:

How long did it take to detect the problem (to track mean time to detection)?
How long until we restored normal services (to track mean time to resolution)?
How was the problem detected?
What was done?
What was the outcome – was it resolved? Did we leave an unfinished issue or task when we responded?
How did we communicate what was happening?
How bad was it really this time?
What would have made it worse?
What might have helped reduce the impact?
What can we do to identify an issue like this earlier?
How can we make it safer and easier for those impacted or responding?
- What is the worst thing someone could do?
- What is the best thing someone could do – and what has to be in place for them to do that?

Step 3 – Analyse the cause

Improving our detection of pen-related incidents is great, as is improving our response strategy is great.

Ultimately though, we want to remove the scourge of red/green empty pens being the only pens available (if that is the big issue).

So we ask the normal questions:

What happened (bad pen incident)?
What might have caused it?
- And what else contributed to it?
- And what else might be a cause?
- So we are sure that is the cause – let’s say we removed that cause for ever, but this happened again- what could the new cause have been?
How often is this likely to happen?
- Some people say how likely, but I prefer how often.
For some of the causes – what might cause them? This is known as the 5-whys.

Then we just go through the questions backwards to see if we can remove the risks/problems

How could we reduce the frequency of this kind of thing happening?
What could we do to tackle to root causes?
What can we do to disconnect the cause from the effect?

Pretend workshop

We discovered that

How can we make it less frequent?
- Check the meeting rooms monthly and update the marker supply
How can we tackle the root causes?
- We could not buy red and green markers, or we could put an end date on them and then replace them automatically
How can we disconnect the cause from the effect
- We can carry our own supply of stationery
- We can use paper instead, or use giant ipads

Step 4 – challenge our logic and assumptions

I find that we often skim the surface or jump to conclusions. For example in our pretend incident:

I saw Mark walking along with a black marker AND a blue marker and he was heading away from the room. I am pretty sure he took our good markers. But there is no evidence of this and, since this is not the first marker-disaster, there are obviously other causes to analyse
Everyone is too polite to mention that Martha chose really expensive champagne to fixt the problem, but several of us can think of better solutions

To do this the best approach (with the cause) is to put them up on a whiteboard (using “dry-erase” markers). We can then draw lines between the alleged cause and alleged effects:

If this happened (permanent mark) then this was the a cause of it (having a permanent marker in the room).
Was that the only cause – is is sufficient?
Did it really cause the effect? Was it inevitable that having a permanent marker within 5 metres of a whiteboard means that it will be used?

Step 4 Move to action

OK – we did a great job discovering the issue and moving from outrage to understanding.

However, nothing will change unless we take some action and that is an issue because

Often this process generates a lot of possible actions; and
Often the suggested improvements are pretty big, pretty obscure or unlikely to happen soon (“improve testing”, or “Certify and equip all staff in whiteboard cleaning protocols”).

So the final part of the pretend workshop is to decide IF we are going to take action or just accept the risk of this happening again. This is the “get over it” strategy.

If we take action to either prepare to respond to the next incident or, ideally, to tackle the root causes and share the learnings.

I like to break the actions into

What we will do to improve detection and response, as well as reduce the blast impact on customers; and
What we do to improve our practices, internal controls, collaboration, etc to prevent issues happening again.

Assuming there is a residual risk (ie something could still go wrong) then we might want to educate, create checklists or prepare a response in advance. Or we might just trust the team will cope. Either way it should be a deliberate choice.

Assuming some of the root causes can be addressed, we need to plan how and when to do that, or we need to admit what we are not going to do.

James King