Most people are questioning the BA’s version on how their entire Information System went down on May 27th 2017, impacting 75,000 passengers for up to 48 hours and will cost up to £80m.
British Airways states that a human error caused by an engineer who disconnected a power supply cable trigerred major outage due to a power surge.
The question is how a such outage lasted so long? The “power surge” term is misleading, because most people will think power in terms of electricity as opposed to Information Ecosystem.
In terms of Service Outage Communication, the challenge is to inform without revealing some embarrassing facts, the challenge is to partially say the truth without lying. In this instance, I must admit that BA is doing a good job.
My theory is that BA’s system crashed as a result of the power outage, but BA’s team did not restart the entire ecosystem in sequence. My assumption is that BA’s system were all restarted simultaneously causing what they have called the “power surge“. The question is whether BA had a datacenter restart runbook, or not, and whether if the required documentation existed, whether it was ever tested.
Complex ecosystems require to restart key Infrastructure components, but following a pre-established sequence. For example, the core storage first, then database cashing infrastructure followed by database systems, this is even more true with architectures based on microservices.
In other words, backend systems should be restarted first followed by frontends. If you do not follow a pre-established sequence, the different components of the ecosystems will randomly resume they operations and start “talking” and expect answers. When a non synchronised datacenter restart is performed, It is likely to end up with data corruption. Furthermore, as the front-end caching infrastructure is not warm, the backend will crash under the load, preventing the reopening of services.
If this scenario happened at BA, the databases storing flight reservations, flight plans and customer details got corrupted up to a point where it became impossible to resume their operations from the second datacentre, also now partially corrupted as a result of the active-active synchronisation performed in between the two datacenter.
British Airways had then no other options than to restore backups and then replay system logs of unsynchronised systems, and then only resume synchronisations with the second datacenter.
Obviously, this is a much more difficult reality to explain, but I talked to several IT experts and no-one, absolutely nobody is buying the power surge story.
I’m looking forward to hearing from the internal investigation that BA’s chief executive has already launched.