18 Sunday Jun 2017
Posted in Business Continuity, Cloud, Innovation, SAAS
06 Tuesday Jan 2015
Posted in Business Continuity, Digital Transformation, Innovation
This post is the second part of the airport chaos.
The inquiry terms explained.
The terms are listed over 5 bullet points requiring some clarifications from a Service Management perspective :
1. Root cause of the incident:
Basically getting down to the bottom of what happened and in particular determining the context which created the conditions leading eventually to the service interruption. A good analogy would be the analysis performed by firefighters after a blaze. Let’s say that, several people were killed by the smoke (cause of the death), however the root cause of the fire could have been a short-circuit in a multi-socket adaptor due for example, by a human error (too many high-hungry appliances being plugged on the same socket).
Special attention will need to be given to recent changes (change management including change control) which may have occurred over the last circa 30 days prior to the incident.
This process is often called ‘Post Mortem’, I understand that in this type of industry, the term wasn’t used.
2. Incident management:
The CAA will need to look at how the incident was handled. From an IT management perspective, it should cover Incident occurrence, System monitoring (Incident detection & Incident reporting), Service stable, service All Clear, and finally Incident communication throughout service restoration. It should also cover whether any of the different phases could have been handled quicker.
3. Problem management (Part 1) i.e lessons learned from the previous incident (December 2013) and in particular whether all the action items listed as part of the previous incident analysis were properly closed and whether any of the root cause(s) of the 2013 incident played any role in the 2014 one.
4. Business continuity plan & capacity planning: “A review of the levels of resilience and service that should be expected across the air traffic network taking into account relevant international benchmarks”.
This is interesting given the communication made by NATS that the “back-up plans and procedures worked on Friday exactly as they were designed to”.
5. Problem management (Part 2) “Further measures to avoid technology or process failures in this critical national infrastructure”:
Basically putting together a plan to resolve (as opposed to mitigate) all underlying causes and root causes which led to the chaos. Note that this plan will be reviewed during the next incident investigation (see Problem Management part 1).
In addition to completing the problem management, NATS have requested that CAA put together a plan addressing the new requirements listed in the upcoming Business Continuity plan “Further measures to reduce the impact of any unavoidable disruption”.
In this case, they will be considering not only the actions required to prevent a similar incident from occurring again but also the actions required to address the other risks they will identify during the root cause analysis. Furthermore, they will put together a plan to increase the resiliency and business continuity (see capacity management) during major outages, considering Recovering Point Objective.
The CAA is expecting to publish the report by the 29 March 2015.
15 Monday Dec 2014
Tags
On the 12th of December at 3:24pm, NATS, the UK-based global air traffic management company, confirmed that a system outage occurred at Swanwick air traffic control centre. “The outage made impossible for the controllers to access data regarding individual flight plans and UK airspace has not been closed, but airspace capacity has been restricted in order to manage the situation”.
On the 14th of December, NATS declared in a statement that the “back-up plans and procedures worked on Friday exactly as they were designed to and the NATS system was back up and running 45 minutes after the event the failure“.
Ok, so why the Transport Secretary, Patrick McLoughlin, has described the shutdown of much of southern England’s airspace as “simply unacceptable“?
According to the NATS annual report and accounts 2014, NATS “operates under licence from the Secretary of State for Transport“.
On page 35 of the same report, chapter “Principal risks and uncertainties”, the very first item mentioned under “Loss of service from an air traffic control centre”, is “result in a loss of revenues” and also that “NATS has invested in developing contingency arrangements which enables the recovery of its service capacity“.
Later on, under the “Operational systems’ resilience” section, we can read that in order “to mitigate the risk of service disruption, NATS regularly reviews the resilience of its operational systems“.
It is very surprising that the loss of revenues is mentioned whereas nothing is included on Service Level Objectives/Agreements (SLO, SLA) and Recovery Point Objectives (RPO).
In their statement, NATS mentions that their systems were back up and running 45 minutes after the event.
According to the FT, “Air passengers faced continued disruption into the weekend even after Nats declared its systems were back to full operational capacity. A spokesman for Heathrow said 38 flights were cancelled before 9.30am on Saturday “as a knock on from yesterday”.
From a Service impact perspective, the service coulld’t be called ALL CLEAR until at least the 13th of December at 9:30am when 30 flights were still cancelled. Therefore the service impact was at least of 18 hours, but we will need the full investigation to find out the full extent of the problem.
What does it mean that “back-up plans and procedures worked on Friday exactly as they were designed to”? It means that the investment decisions, risk assessments, business recovery plans were designed, validated, tested and approved for at least 45 minutes system outage with reduced capacity.
According to Telegraph.co.uk, “Britain is now the only country that is using the 1960s software. It is due to be replaced with a new system from a Spanish company in about two years, but until then they will just have to manage.”
At least we know!