Feature Anyone who has worked in medium or large organisations will know that there are three levels of change control when it comes to code: (a) the organisation doesn’t have any, (b) the organisation has change control but does it sub-optimally, and (c) change is managed well.

Anyone who has worked under more than one of these three levels will have seen that the closer you get to (c), the less change-induced disruption you experience. And yes, this probably sounds like common sense, but as with many aspects of real life, common sense turns out not to be as common as you’d like.

Example 1

Take company A as an example. The IT department had one of the most skilled, capable teams one could ever encounter – roles were well defined and each of the senior engineers was at the top of his or her game for the fields in which they worked. Yet things kept breaking, so senior management decided to see what was going wrong.

The core problems turned out to be twofold. First, changes that were being made were not planned very well – particularly with regard to how a change would be backed out if something unexpected went wrong. So, in the event of a problem, the engineers would have to think on the hoof to get things back up and running.

Second, any unexpected impact was often not with the thing that was being changed, but something completely different: something that wasn’t in the test plan and whose issues weren’t spotted until the next morning when a user of that system called the Service Desk.

The introduction of change control – which was made easier thanks to it being driven by a senior manager who left nobody under any illusion that using the process wasn’t negotiable – made an instant difference. This was because people had to write test plans, they had to have workable and tested rollback plans, and they had to get the sign-off of any business units that had a reasonable chance of being affected.

Before we go on, it is worth noting that the existence of the test plans, rollback plans and whatnot was not in itself the cause of improvement. What made the difference was that for the first time the engineers had to think about the whole piece of work: from inception through communication, implementation and testing. Plus, if necessary, rollback and regression testing. Even if you don’t write these things down, the act of just thinking about them properly can be a revelation.

There was one small speed bump in this particular organisation and it was one that I have seen in many others over the years: the change control meetings were undemocratic and one or two of the most vocal individuals tended to be able to guide approvals and rejections to fit their personal agendas. We’ll come back to that after the next example.

Example 2

This case is very similar to the first: a senior IT manager had a similar revelation that too much failure was happening, and worked with his colleagues to introduce change control. The result, again, was a dramatic fall in unexpected failure, for precisely the same reasons.

In the second company, the manager who introduced change control – and who chaired the change meetings – had a lightbulb moment after a few weeks, realising that he had a conflict of interest because a large minority of the changes discussed were from his team. The answer? He approached a member of the risk team and asked them to run the process instead, as they seldom had an axe to grind.

It’s a variation on our speedbump from above: if there is an issue with conflict, or with strong-minded individuals trying to sway the meetings, give the change regime to an owner who is impartial and fair but firm and consistent, and give them the mandate to put people back in their boxes should it be needed.

The second company had a major difference from the first, incidentally. It was a consumer-facing company rather than (in the case of the first) servicing only businesses. This brought with it the customer-impact factor: technical changes could actually stop consumers’ services working.

Your secret weapon: The marketing wonk?

The Change Advisory Board (CAB) mainly included people of a technical inclination, but the most effective ingredient was nothing of the sort: they were a non-techie from the marketing department. And this person would ask variations on the same two questions.

Firstly, “what does this actually mean?” Engineers completing change documentation have a tendency, if left unchallenged, to use technical words and inadvertently assume a certain level of understanding in the reader.

Not only did the person from marketing gain a better understanding over time, but one could see the palpable relief on the faces of some people around the room who also didn’t really understand the change fully but were too shy to ask.

Failure is not a dirty word

The second question was “what will the customer impact be?” to which two answers were required: the likely impact of the change going correctly, and the potential medium-to-worst-case impact of failure.

The latter of these points was the hardest to quantify, of course, as it’s largely a “how long is a piece of string?” question. But as with the documentation point mentioned earlier, even if you can’t put a solid figure on it, the question provokes debate, makes people think seriously about failure and manages expectation.

Which brings us neatly to the other key point: failure. People don’t like failure – it makes them feel bad, and it makes them worry that others think badly of them. Yet failure is inevitable from time to time. Although many unforeseen problems were in fact foreseeable if someone had taken the time to consider them, there are such things as really unforeseeable problems.

Optimism is fine until something explodes

When we build a new system or write a new piece of software, we plan the work to take place over a number of hours, days, weeks or months. And the process is pretty straightforward: design, implementation, testing, go-live.

Change planning is really not that much different – it’s just that rather than designing and building something, you might simply be applying upgrades or swapping out end-of-service hardware for the new version of the same thing. But that process is incomplete and if you spotted the mention of regression testing earlier, you probably realise why.

The problem is simple. Why do we test things before we let them go live? It’s obviously because we want to make sure they work correctly. But if we were sure that everything was going to work, we could simply skip the testing. So, in reality, we test things because we anticipate that there will be more than zero failures and that we might have to fix something.

Engineers are an optimistic bunch, so they tend toward believing that the change will go well. As a result, they may underestimate the time window required for testing, remediation, rollback and regression testing… right up until the point something breaks spectacularly and the change busts its planned window. Or in a more positive sense, until they realise without breaking something that testing and rollback are a big deal.

The first company in our example once carried out a huge change involving engineers and project managers from a number of companies working together on a conference call.

The implementation was planned to take 10 hours, with a further hour for testing; the rollback-and-retest plan was a whopping 19 hours. Nobody on that project was under any illusion that “Reverse the implementation plan” was adequate for a rollback plan, yet it’s something one sees all the time in many organisations: a key rule of change, then, is to include testing and remediation in the plan and the timescale.

What does success even look like?

This leads us to our final point: success. What, in the context of change, does “success” mean? Or more to the point, what does “failure” mean?

It would be easy to define success as the change going ahead with zero impact to users or customers. That would be wrong, though, because some changes cannot happen without downtime. For example, a successful exercise to replace the wheels on your car has an inevitable outage, as it can’t be done as you are driving it.

It is perhaps better to consider “completion” rather than success. If the change has gone ahead, and it was done within the approved time window, and the impact was no more than that predicted, we can put our hands on our hearts and deem it “completed”.

We should be honest and consider whether it was completed successfully (which means it went exactly as planned) or completed with some issues (that is, we hit a couple of snags which we were able to fix without breaking anything else or busting the time window).

Turn and face the strange

The latter point gives us scope to learn from what went awry and do better next time. If we had to roll the change back but it was still within the approved criteria, then we label it “rolled back”. If none of the above applies, then it “failed”… and yes, failure includes the case where you exceeded the time window or you impacted more than you predicted.

In every organisation I have worked with that has introduced change control, or has evolved and improved an existing change regime over time, change has become more successful and less impactful.

It is incredibly easy to introduce a basic change control structure, so if you don’t have one, perhaps now is the time to do it.

Even if it only gets you to level (b) from the first sentence of this article (“the organisation has change control but does it sub-optimally”), that’s a world better than (a). ®

Source link