Risks in IT Operations – some analysis tools

Not long ago an airline (Virgin Blue) had a complete meltdown of their ticketing system and just last week a major Australian bank (NAB) had a full on disaster where they couldn’t pay people money they owed for several days.

Both were apparently due to failures in IT operations (IT systems and IT processes) and both must have been dreadful for those involved.

I sometimes run risk management workshops and I have a number of approaches to help your team assess and manage the risks they face. 

At the most basic level we create a risk register in line with Australian and international risk standards.  This seems straight forward yet many project and IT managers miss some of the risk basics like triggers (the warning sign that a risk is materialising or becoming more likely).

Less often, but just as valuable I work with managers on how to better communicate risks, both to manage them when they are risks or prepare a communication plan to deal more effectively with stakeholders during a crisis. 

Most often I run workshops on how to identify and analyse the risks the team faces.

So I would love to come and help you better understand, communicate and manage your risks, particularly over January and February where I traditionally have a light workload.

But if you want to do it yourself, here are some of my favourite tools.  Rather than fully explain them I thought I would quickly show you some and then attach a long case study from a university assignment I did a long time ago.  If you are mildly interested, then enjoy the diagrams and if you are more interested then grab a coffee or tea and then read through the case study.  I hope you will find that the tools are quite useful on their own or as a set.

Step one in risk management is to understand the context within which risks exist.  I don’t have any examples here, but always start with a general understanding.

Then, identify what could go wrong.  I often use FMEA (Failure Mode and Effect Analysis) for processes – it sounds like fun but it is really just a set of questions (What could go wrong? What would cause that to happen?, How/when would we find out? How could we find out sooner? What would we do? How could we avoid it?).

But for meaty problems like IT system failure, the failure of a key process (such as software releases) and so forth it is better to do a workshop or some analysis using a fault tree.

clip_image002

A fault tree simply breaks down a risk to show what could cause it.  But the real power is that is includes the words “and” and “or”.  So you can represent “this could be caused by a backup failure and either a failure to test the back up or a crash before the next test”.

Once you have a giant list of causes then you want to go through the impacts.  People often just list one impact or label the impact “high, medium or low”, but in fact there can often be a range of impacts from trivial to catastrophic.  So a good way to workshop and represent this is to show the potential impacts in an event tree:

clip_image002[7]

This helps the team prepare for the scary scenarios and also for the less scare (often more common) ones.

Finally we want to look at the controls and barriers that protect us from the risk and the actions we would take to detect and mitigate the problem if it does materialise.

This is particularly important because people often over-protect themselves from one part of the risk but fail to notice that they are heavily exposed to another cause, or they have no plan for a particular outcome.

The over all causes, barriers, mitigations and outcomes can then be displayed in a bow tie (or butterfly) diagram:

clip_image002[9]

So there you have it – a vague structure to run a workshop or do some analysis and some cool (I think) ways to capture and display the information you generate.

Please call me if you would like my help, either running some of these workshops or coaching your crew in how to do it.  But also feel free to read through the attached case study if you would like to work out how to do it yourself.

Case study:

James King Randomcorp case study on risk management

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s