Major global IT outage

The importance of criticality (not risk) in due diligence

The recent major IT outage that affected systems around the world highlights the importance of criticality and resilience. 

As due diligence engineers, if you have heard Richard and I talk before, we continually speak about the importance of criticality rather than risk.  That is, those issues that are critical to your organisation regardless of the likelihood.

Richard and I were actually in the air, flying back from New Zealand when the failure occurred.  We returned to see the chaos at airports around the world with cancelled flights as ticketing and check-in systems failed.  Having to revert back to manual check-in seemed difficult, meaning that there wasn’t an effective back-up system in place. 

It was apparently reported in the media that this outage was seen as an accident.

Yes, we agree that the event was not deliberate but it is certainly a foreseeable event that there could be an IT outage (for whatever reason), that has the potential to impact organisations.  And therefore, we would not consider it an accident, but a credible critical threat.

Many of these critical issues in organisations are still being considered and managed on a risk basis, the simultaneous appreciation of likelihood and consequence.  This is primarily being driven from the financial side of the business with organisations always trying to optimise return.  What organisations try to do is achieve the greatest system availability at least cost.

Now the problem with that, of course, is that if you're relying on a single system, you can't gold plate that single system so that it will never fail.  You just can't stop all single points of failure. 

Gold plating can potentially provide some system resilience but it will never provide system redundancy.  Therefore, if you're down to one system and it fails, then you must have a backup. 

That means we are talking about system redundancy by making sure that you've got another independent system that doesn't rely on the primary system, so that in the event that the primary system does fail, there is a genuinely independent backup. 

There are lots of different ways to do it.  However, the only way we know to get high availability at low cost is have two redundant systems in parallel.   

In conclusion, for organisations that rely on IT systems to deliver their products and services, such as airlines and health services, a major outage is a critical issue and must be (seen to be) managed accordingly from a due diligence perspective.

Listen to Richard & Gaye discuss the IT Outage in this Risk! Engineers Talk Governance episode.

 

 

Next
Next

Global warming and Criticality & Design