Measuring the impact on production support

I was running a course in agile development when I mentioned that one of the good things about agile is being able go live with something valuable sooner.

One of the class asked whether you can measure the benefit of going live sooner. “Of course,” I replied, “and of course you should be doing so”.

Some of the group asked if we measured value in features deployed or some other way. So we had a good discussion around measuring value.

But then one of the group told us that his project was about “simplifying IT” and so his agile project manager had told her that, since the project was not adding any value to “the business” the only real measure of success was whether they deployed the features they were supposed to deploy.

But this seemed a bit silly. So we agreed that adding value to IT was in fact adding value to the business, since IT is part of the business.

I told the class about my lazy measure for technical debt and asked if it was sufficient but several people thought we should be doing better. So then we discussed some measures that might be useful and we found that many of them were already there in the production support team.

Our main focus ended up being reliability, so I explained some ancient measures that were used in the production support team I ran in Babylonian times.

We agreed that reliability was made up of the following components:

Category	Component	Measure and comments
Availability	Availability	Uptime = 1 – (Downtime due to failure + downtime due to maintenance) / (Total operating hours).Note that some systems had 24 x 7 operating hours but some were only meant to be available 8am to 6pm on weekdays.
	Downtime due to failure	This was just the time the system was unavailable during normal operating hours. Some teams can measure this automatically while others agreed they could “ping the system” during the day to see if it is up.
	Downtime due to maintenance	Most of the class thought it was unfair to include scheduled maintenance time as downtime. But we agreed that if the project was “improving IT reliability” then one improvement should be needing less time for scheduled maintenance during operational hours.
Defects	Change due to project	Change in defects = (existing defects fixed by the project) – (new defects handed over by the project at go-live).Everyone said that they were measuring defects on projects, but they were not measuring the defects handed over to production. Nor were they measuring whether existing defects were being removed by projects.
	Workarounds	Workaround improvement = (workarounds removed) – (workarounds introduced)We weren’t sure how to measure workarounds. We thought of simply asking “will the team and customers spend more/less time on workarounds” but we thought a better measure might just be whether there were more or less after going live.
Speed loss	Speed loss	Speed loss = (Time where response time falls below a threshold level) / (total operating hours)
Data	Data corruption rate	Data corruption = (number of data errors found + number of times data is not available) / (number of transactions run). Or data corruption rate = (number of errors reported) / (period of time)Some of the group claimed that data errors are caused by “the business” and not the support team. But we all agreed that a more reliable system would be one where less time was lost (or errors made) because the system allowed incorrect data to be entered.
System suckiness	Hassle factor	Hassle factor = (time spent responding to errors + time spent monitoring or measuring) / Total time available.This can be measured by timesheets (yuck) or by “10 finger vote”. For example if I hold up 6 fingers for “time spent responding to errors this week (or day)” then I have spent 60% of my time doing that.
	Value factor	Value factor = (time spent adding enhancements or giving the stakeholders something new that they wanted)
	Self service change	Self service change = (Things the project enables customers to do themselves that previously required a support call) – (new things that customers call the team to do)

These are probably not the best measures available in the world, but Most of the team agreed that they could measure these things – and that if they did then they could measure the impact of new projects on these measures.

“But we are not measuring any of this” said one of the class.

“I guess we have learned a lot since Babylonian times,” I said, “You must have much better measures in place now days”.

It appears though that some of the teams were not measuring reliability in production support at all and that none of them were measuring the impact of projects on that reliability. Fair enough, but these measures are available if you want them.

James King

Measuring the impact on production support

Art has negative space and my meetings have burnout

Does AI make you tired?

Goal, hypothesis or hopethesis?

Dodgy history moments in lean thinking

Reframing my identity as an editor

Opinion before evidence … or is it the other way around?