Tags
Airport Chaos, Business Transformation, Capacity Management, capacity planning, NATS, post mortem, service delivery, service design, Service Management, service strategy
I’m following up on my last post (read for full details) about the Airport chaos which occurred on the 12th of December 2014 at Swanwick air traffic control centre.
What happened?
NATS, the UK-based global air traffic management company declared that system was back up and running 45 minutes after the event the failure.
An independent inquiry.
On the 15.12.14, NATS declared that the UK Civil Aviation Authority (CAA) will carry out “an independent inquiry following the disruption caused by the failure in air traffic management systems”.
On the NATS’ web site, there is only a mention of the high level plan of the independent inquiry (I’ll explain it in a new post tomorrow). However, drilling down into the CAA’s web site, I was able to find the inquiry terms of reference.
Timelines of events and service impacts.
As provided by the CAA, this is where it gets interesting:
1. Service outage started at approximately 1515 GMT. Following “the fault in a primary and back-up system led to a failure in the flight data server in the Area Control (AC) Operations Room”.
2. Service restoration starts : “Restrictions were gradually lifted from approximately 1605 GMT with a rapid recovery to full capacity by the middle of the Friday evening”.
This is not a precise timing, however the CAA provides more insights on the true service impact.
The CAA confirms that “Delays and cancellations were incurred totalling some 16,000 MINUTES”. The 45 minutes initially reported represented only system downtime, not service impact measured from a business perspective.


“Airlines cancelled around 80 flights: estimated to be 2,000 minutes as a consequence of the restrictions put in place to manage traffic”.
Furthermore, 14,000 minutes as “result of the phased recovery to prevent overloads and takes account of ground congestion at the major airports”.
“Overall around 450 aircraft were delayed of the 6,000 handled on the 12 December and the average delay to the 450 flights was approximately 45 minutes”.
The CAA reminds us that a “failure affecting the same operations room at Swanwick on 7 December 2013, which resulted in total delay amounting to 126,000 minutes and which impacted 1,412 flights and the cancellation of 300”.
NATS made a mistake by not communicating the progress made over the full service recovery and eventually on the total impact on service uptime.
Service Performance.
Remember that no one cares whether your servers are up and running when no one, or only some of your customers can access your IT services, for example SAP, email, document management, internet corporate site or ecommerce site. This is the same here, systems were up but aircraft could not take off and passengers were badly impacted.
Your service performance should always be measured as being perceived from your customer’s point of view, not from a piece of infrastructure being up perspective.
A service is very rarely operating in isolation, it operates within an echo system made of your own capabilities and what your suppliers and partners are delivering within this ecosystem. Did the hotels and restaurants have enough vacancies to welcome the passengers?
Once services have been restored, everyone should be concerned by customers still suffering from the consequences, such as holidays cancelled (and potentially not reimbursed) or as after Black Friday (see my post) when the products were delivered after Christmas, long after services being actually restored and stable. Will the actual whole cost of the outage at the Swanwick air traffic control centre be ever known? I doubt.
However, NATS has already announced that “there will be a financial consequence for the company from the delay caused. Under the company’s regulatory performance regime, customers will receive a rebate on charges in the future”.
Capacity management.
The NATS managing director of Operation declared during the incident resolution: “These things are relatively rare. We are a very busy island for air traffic control, so we’re always going to be operating near capacity ”.
This is a very concerning statement. Getting service impacted by shortage in capacity is not uncommon (I’m not saying it is satisfactory) when either capacity requirements aren’t properly expressed by the business, or, when the same requirements aren’t adequately translated into efficient technical design. However, it is the responsibility of the CIO to properly and efficiently document and communicate the risks incurred by potential shortage in capacity.
Vince cable declared that the incident have been caused by lack of IT investments. Well, the question is now whether the investments were submitted and refused. The inquiry will need to determine whether the risks of running “at capacity” were properly communicated to the board.
The CAA is expecting to publish the report by the 29 March 2015.
Pingback: Airport chaos – The inquiry terms explained | Ed Boris