British Airways Technology Chaos, really caused by a power surge?

18 Sunday Jun 2017

Posted by Edouard Boris in Business Continuity, Cloud, Innovation, SAAS

Tags

Capacity Management, Incident Management, Service Management

Most people are questioning the BA’s version on how their entire Information System went down on May 27th 2017, impacting 75,000 passengers for up to 48 hours and will cost up to £80m.

British Airways states that a human error caused by an engineer who disconnected a power supply cable trigerred major outage due to a power surge.

The question is how a such outage lasted so long? The “power surge” term is misleading, because most people will think power in terms of electricity as opposed to Information Ecosystem.

In terms of Service Outage Communication, the challenge is to inform without revealing some embarrassing facts, the challenge is to partially say the truth without lying. In this instance, I must admit that BA is doing a good job.

My theory is that BA’s system crashed as a result of the power outage, but BA’s team did not restart the entire ecosystem in sequence. My assumption is that BA’s system were all restarted simultaneously causing what they have called the “power surge“. The question is whether BA had a datacenter restart runbook, or not, and whether if the required documentation existed, whether it was ever tested.

Complex ecosystems require to restart key Infrastructure components, but following a pre-established sequence. For example, the core storage first, then database cashing infrastructure followed by database systems, this is even more true with architectures based on microservices.

In other words, backend systems should be restarted first followed by frontends. If you do not follow a pre-established sequence, the different components of the ecosystems will randomly resume they operations and start “talking” and expect answers. When a non synchronised datacenter restart is performed, It is likely to end up with data corruption. Furthermore, as the front-end caching infrastructure is not warm, the backend will crash under the load, preventing the reopening of services.

If this scenario happened at BA, the databases storing flight reservations, flight plans and customer details got corrupted up to a point where it became impossible to resume their operations from the second datacentre, also now partially corrupted as a result of the active-active synchronisation performed in between the two datacenter.

British Airways had then no other options than to restore backups and then replay system logs of unsynchronised systems, and then only resume synchronisations with the second datacenter.

Obviously, this is a much more difficult reality to explain, but I talked to several IT experts and no-one, absolutely nobody is buying the power surge story.

I’m looking forward to hearing from the internal investigation that BA’s chief executive has already launched.

Airport chaos – The inquiry terms explained

06 Tuesday Jan 2015

Posted by Edouard Boris in Business Continuity, Digital Transformation, Innovation

≈ Leave a comment

Tags

Capacity Management, Incident Management, post mortem, service delivery, Service Management

This post is the second part of the airport chaos.

The inquiry terms explained.

The terms are listed over 5 bullet points requiring some clarifications from a Service Management perspective :

1. Root cause of the incident:

Basically getting down to the bottom of what happened and in particular determining the context which created the conditions leading eventually to the service interruption. A good analogy would be the analysis performed by firefighters after a blaze. Let’s say that, several people were killed by the smoke (cause of the death), however the root cause of the fire could have been a short-circuit in a multi-socket adaptor due for example, by a human error (too many high-hungry appliances being plugged on the same socket).

Special attention will need to be given to recent changes (change management including change control) which may have occurred over the last circa 30 days prior to the incident.

This process is often called ‘Post Mortem’, I understand that in this type of industry, the term wasn’t used.

2. Incident management:

The CAA will need to look at how the incident was handled. From an IT management perspective, it should cover Incident occurrence, System monitoring (Incident detection & Incident reporting), Service stable, service All Clear, and finally Incident communication throughout service restoration. It should also cover whether any of the different phases could have been handled quicker.

3. Problem management (Part 1) i.e lessons learned from the previous incident (December 2013) and in particular whether all the action items listed as part of the previous incident analysis were properly closed and whether any of the root cause(s) of the 2013 incident played any role in the 2014 one.

4. Business continuity plan & capacity planning: “A review of the levels of resilience and service that should be expected across the air traffic network taking into account relevant international benchmarks”.

This is interesting given the communication made by NATS that the “back-up plans and procedures worked on Friday exactly as they were designed to”.

5. Problem management (Part 2) “Further measures to avoid technology or process failures in this critical national infrastructure”:

Basically putting together a plan to resolve (as opposed to mitigate) all underlying causes and root causes which led to the chaos. Note that this plan will be reviewed during the next incident investigation (see Problem Management part 1).

In addition to completing the problem management, NATS have requested that CAA put together a plan addressing the new requirements listed in the upcoming Business Continuity plan “Further measures to reduce the impact of any unavoidable disruption”.

In this case, they will be considering not only the actions required to prevent a similar incident from occurring again but also the actions required to address the other risks they will identify during the root cause analysis. Furthermore, they will put together a plan to increase the resiliency and business continuity (see capacity management) during major outages, considering Recovering Point Objective.

The CAA is expecting to publish the report by the 29 March 2015.

Airports chaos: Why the service impact lasted 16,000 minutes rather than 45 minutes as initially reported.

05 Monday Jan 2015

Posted by Edouard Boris in Business Continuity, Cloud, Cyber, Digital Transformation, RightSourcing

≈ 1 Comment

Tags

Airport Chaos, Business Transformation, Capacity Management, capacity planning, NATS, post mortem, service delivery, service design, Service Management, service strategy

I’m following up on my last post (read for full details) about the Airport chaos which occurred on the 12th of December 2014 at Swanwick air traffic control centre.

What happened?

NATS, the UK-based global air traffic management company declared that system was back up and running 45 minutes after the event the failure.

An independent inquiry.

On the 15.12.14, NATS declared that the UK Civil Aviation Authority (CAA) will carry out “an independent inquiry following the disruption caused by the failure in air traffic management systems”.

On the NATS’ web site, there is only a mention of the high level plan of the independent inquiry (I’ll explain it in a new post tomorrow). However, drilling down into the CAA’s web site, I was able to find the inquiry terms of reference.

Timelines of events and service impacts.

As provided by the CAA, this is where it gets interesting:

1. Service outage started at approximately 1515 GMT. Following “the fault in a primary and back-up system led to a failure in the flight data server in the Area Control (AC) Operations Room”.

2. Service restoration starts : “Restrictions were gradually lifted from approximately 1605 GMT with a rapid recovery to full capacity by the middle of the Friday evening”.

This is not a precise timing, however the CAA provides more insights on the true service impact.

The CAA confirms that “Delays and cancellations were incurred totalling some 16,000 MINUTES”. The 45 minutes initially reported represented only system downtime, not service impact measured from a business perspective.

“Airlines cancelled around 80 flights: estimated to be 2,000 minutes as a consequence of the restrictions put in place to manage traffic”.

Furthermore, 14,000 minutes as “result of the phased recovery to prevent overloads and takes account of ground congestion at the major airports”.

“Overall around 450 aircraft were delayed of the 6,000 handled on the 12 December and the average delay to the 450 flights was approximately 45 minutes”.

The CAA reminds us that a “failure affecting the same operations room at Swanwick on 7 December 2013, which resulted in total delay amounting to 126,000 minutes and which impacted 1,412 flights and the cancellation of 300”.

NATS made a mistake by not communicating the progress made over the full service recovery and eventually on the total impact on service uptime.

Service Performance.

Remember that no one cares whether your servers are up and running when no one, or only some of your customers can access your IT services, for example SAP, email, document management, internet corporate site or ecommerce site. This is the same here, systems were up but aircraft could not take off and passengers were badly impacted.

Your service performance should always be measured as being perceived from your customer’s point of view, not from a piece of infrastructure being up perspective.

A service is very rarely operating in isolation, it operates within an echo system made of your own capabilities and what your suppliers and partners are delivering within this ecosystem. Did the hotels and restaurants have enough vacancies to welcome the passengers?

Once services have been restored, everyone should be concerned by customers still suffering from the consequences, such as holidays cancelled (and potentially not reimbursed) or as after Black Friday (see my post) when the products were delivered after Christmas, long after services being actually restored and stable. Will the actual whole cost of the outage at the Swanwick air traffic control centre be ever known? I doubt.

However, NATS has already announced that “there will be a financial consequence for the company from the delay caused. Under the company’s regulatory performance regime, customers will receive a rebate on charges in the future”.

Capacity management.

The NATS managing director of Operation declared during the incident resolution: “These things are relatively rare. We are a very busy island for air traffic control, so we’re always going to be operating near capacity ”.

This is a very concerning statement. Getting service impacted by shortage in capacity is not uncommon (I’m not saying it is satisfactory) when either capacity requirements aren’t properly expressed by the business, or, when the same requirements aren’t adequately translated into efficient technical design. However, it is the responsibility of the CIO to properly and efficiently document and communicate the risks incurred by potential shortage in capacity.

Vince cable declared that the incident have been caused by lack of IT investments. Well, the question is now whether the investments were submitted and refused. The inquiry will need to determine whether the risks of running “at capacity” were properly communicated to the board.

The CAA is expecting to publish the report by the 29 March 2015.

Airports chaos: “Back-up plans worked exactly as they were designed to”

15 Monday Dec 2014

Posted by Edouard Boris in Business Continuity, Digital Transformation, Financing Decision, Innovation, Risk management

≈ 1 Comment

Tags

NATS, Service Management

On the 12th of December at 3:24pm, NATS, the UK-based global air traffic management company, confirmed that a system outage occurred at Swanwick air traffic control centre. “The outage made impossible for the controllers to access data regarding individual flight plans and UK airspace has not been closed, but airspace capacity has been restricted in order to manage the situation”.

On the 14th of December, NATS declared in a statement that the “back-up plans and procedures worked on Friday exactly as they were designed to and the NATS system was back up and running 45 minutes after the event the failure“.

Ok, so why the Transport Secretary, Patrick McLoughlin, has described the shutdown of much of southern England’s airspace as “simply unacceptable“?

According to the NATS annual report and accounts 2014, NATS “operates under licence from the Secretary of State for Transport“.

On page 35 of the same report, chapter “Principal risks and uncertainties”, the very first item mentioned under “Loss of service from an air traffic control centre”, is “result in a loss of revenues” and also that “NATS has invested in developing contingency arrangements which enables the recovery of its service capacity“.

Later on, under the “Operational systems’ resilience” section, we can read that in order “to mitigate the risk of service disruption, NATS regularly reviews the resilience of its operational systems“.

It is very surprising that the loss of revenues is mentioned whereas nothing is included on Service Level Objectives/Agreements (SLO, SLA) and Recovery Point Objectives (RPO).

In their statement, NATS mentions that their systems were back up and running 45 minutes after the event.

According to the FT, “Air passengers faced continued disruption into the weekend even after Nats declared its systems were back to full operational capacity. A spokesman for Heathrow said 38 flights were cancelled before 9.30am on Saturday “as a knock on from yesterday”.

From a Service impact perspective, the service coulld’t be called ALL CLEAR until at least the 13th of December at 9:30am when 30 flights were still cancelled. Therefore the service impact was at least of 18 hours, but we will need the full investigation to find out the full extent of the problem.

What does it mean that “back-up plans and procedures worked on Friday exactly as they were designed to”? It means that the investment decisions, risk assessments, business recovery plans were designed, validated, tested and approved for at least 45 minutes system outage with reduced capacity.

According to Telegraph.co.uk, “Britain is now the only country that is using the 1960s software. It is due to be replaced with a new system from a Spanish company in about two years, but until then they will just have to manage.”

At least we know!

Black Friday

01 Monday Dec 2014

Posted by Edouard Boris in Black Friday 2014

≈ Leave a comment

Tags

Black Friday, Business Continuity, Capacity Management, Service Management

According to the BBC, Amazon’s “website recorded orders for more than 5.5 million goods, with about 64 items sold per second.”

Meanwhile, Tesco direct, Argos, Currys and John Lewis implemented queuing on their site.

On Tesco Direct, the queuing technic involved a rolling 30-second period when your browser would try to access the site. If unsuccessful, the browser would start yet an other 30-second period.

Argos had simply put an holding page up, informing their customers of the massive traffic and asking them to try again.

In the case of Tesco Direct and of Argos, the tactic used was aimed at mitigating the customer’s experience already on the site.

Basically, the retailers implemented an e-bouncer at the entrance to ensure that if you were lucky enough to be on the online shop, you could still move around and buy stuff. And if you were kept outside of the online shop, you would have to wait for someone to get out of the shop first before being allowed to get in.

The objective is to avoid that the entire online shop collapses and then no sales could be processed.

This tactic is required when the demand exceeds the technical capacities (software, infrastructure, network) of the online shop.

Meanwhile Amazon had put in place a smart demand management. The deals were opened for a period of time, openly communicated, informing customers of the remaining stock level, and offering a wait list when not full.

Throwing more tin into the mix or e-bouncing shoppers is not the only answer and Amazon got it. They developed a control demand strategy coupled with great marketing. Well-done Amazon.

The question is whether all orders will be delivered on time. Some retailers have already extended the delay when you buy online and collect in store.

In my opinion, all these technical glitches are far less terrible than the events in stores. As Barbara Ellen wrote in The Observer, Sunday 30 November 2014: “The Black Friday shopping scrums are so shaming”.

Ed Boris

~ Expert in digital transformation

Tag Archives: Service Management

British Airways Technology Chaos, really caused by a power surge?

Airport chaos – The inquiry terms explained

Airports chaos: Why the service impact lasted 16,000 minutes rather than 45 minutes as initially reported.

What happened?

An independent inquiry.

Timelines of events and service impacts.

Service Performance.

Capacity management.

Airports chaos: “Back-up plans worked exactly as they were designed to”

Black Friday