British Airways Technology Chaos, really caused by a power surge?

18 Sunday Jun 2017

Posted by Edouard Boris in Business Continuity, Cloud, Innovation, SAAS

Tags

Capacity Management, Incident Management, Service Management

Most people are questioning the BA’s version on how their entire Information System went down on May 27th 2017, impacting 75,000 passengers for up to 48 hours and will cost up to £80m.

British Airways states that a human error caused by an engineer who disconnected a power supply cable trigerred major outage due to a power surge.

The question is how a such outage lasted so long? The “power surge” term is misleading, because most people will think power in terms of electricity as opposed to Information Ecosystem.

In terms of Service Outage Communication, the challenge is to inform without revealing some embarrassing facts, the challenge is to partially say the truth without lying. In this instance, I must admit that BA is doing a good job.

My theory is that BA’s system crashed as a result of the power outage, but BA’s team did not restart the entire ecosystem in sequence. My assumption is that BA’s system were all restarted simultaneously causing what they have called the “power surge“. The question is whether BA had a datacenter restart runbook, or not, and whether if the required documentation existed, whether it was ever tested.

Complex ecosystems require to restart key Infrastructure components, but following a pre-established sequence. For example, the core storage first, then database cashing infrastructure followed by database systems, this is even more true with architectures based on microservices.

In other words, backend systems should be restarted first followed by frontends. If you do not follow a pre-established sequence, the different components of the ecosystems will randomly resume they operations and start “talking” and expect answers. When a non synchronised datacenter restart is performed, It is likely to end up with data corruption. Furthermore, as the front-end caching infrastructure is not warm, the backend will crash under the load, preventing the reopening of services.

If this scenario happened at BA, the databases storing flight reservations, flight plans and customer details got corrupted up to a point where it became impossible to resume their operations from the second datacentre, also now partially corrupted as a result of the active-active synchronisation performed in between the two datacenter.

British Airways had then no other options than to restore backups and then replay system logs of unsynchronised systems, and then only resume synchronisations with the second datacenter.

Obviously, this is a much more difficult reality to explain, but I talked to several IT experts and no-one, absolutely nobody is buying the power surge story.

I’m looking forward to hearing from the internal investigation that BA’s chief executive has already launched.

Airport chaos – The inquiry terms explained

06 Tuesday Jan 2015

Posted by Edouard Boris in Business Continuity, Digital Transformation, Innovation

≈ Leave a comment

Tags

Capacity Management, Incident Management, post mortem, service delivery, Service Management

This post is the second part of the airport chaos.

The inquiry terms explained.

The terms are listed over 5 bullet points requiring some clarifications from a Service Management perspective :

1. Root cause of the incident:

Basically getting down to the bottom of what happened and in particular determining the context which created the conditions leading eventually to the service interruption. A good analogy would be the analysis performed by firefighters after a blaze. Let’s say that, several people were killed by the smoke (cause of the death), however the root cause of the fire could have been a short-circuit in a multi-socket adaptor due for example, by a human error (too many high-hungry appliances being plugged on the same socket).

Special attention will need to be given to recent changes (change management including change control) which may have occurred over the last circa 30 days prior to the incident.

This process is often called ‘Post Mortem’, I understand that in this type of industry, the term wasn’t used.

2. Incident management:

The CAA will need to look at how the incident was handled. From an IT management perspective, it should cover Incident occurrence, System monitoring (Incident detection & Incident reporting), Service stable, service All Clear, and finally Incident communication throughout service restoration. It should also cover whether any of the different phases could have been handled quicker.

3. Problem management (Part 1) i.e lessons learned from the previous incident (December 2013) and in particular whether all the action items listed as part of the previous incident analysis were properly closed and whether any of the root cause(s) of the 2013 incident played any role in the 2014 one.

4. Business continuity plan & capacity planning: “A review of the levels of resilience and service that should be expected across the air traffic network taking into account relevant international benchmarks”.

This is interesting given the communication made by NATS that the “back-up plans and procedures worked on Friday exactly as they were designed to”.

5. Problem management (Part 2) “Further measures to avoid technology or process failures in this critical national infrastructure”:

Basically putting together a plan to resolve (as opposed to mitigate) all underlying causes and root causes which led to the chaos. Note that this plan will be reviewed during the next incident investigation (see Problem Management part 1).

In addition to completing the problem management, NATS have requested that CAA put together a plan addressing the new requirements listed in the upcoming Business Continuity plan “Further measures to reduce the impact of any unavoidable disruption”.

In this case, they will be considering not only the actions required to prevent a similar incident from occurring again but also the actions required to address the other risks they will identify during the root cause analysis. Furthermore, they will put together a plan to increase the resiliency and business continuity (see capacity management) during major outages, considering Recovering Point Objective.

The CAA is expecting to publish the report by the 29 March 2015.

How Doctor House can help your team improving critical thinking and problem solving competencies?

14 Wednesday May 2014

Posted by Edouard Boris in Business Continuity

≈ Leave a comment

Tags

Incident Management

As I wrote in a previous post, I noticed in my career that the engineers who are making the biggest contribution aren’t necessarily the ones with the strongest specific software, architecture or mathematical skills. They need to develop a new mindset: critical thinking and problem-solving competencies.

Do not confuse ‘continuous improvement’ initially developed back in 1880 and also known as the suggestion system. http://www.answers.com/topic/employee-suggestion-systems

Toyota still uses it, but this is not a problem-solving approach. The suggestion system requires an individual to make suggestions on whatever improvement they have an idea about, whereas problem-solving requires a team.

Now, do you remember Doctor House with his team writing down the previous patient conditions and symptoms (known previous problem, known previous changes), do you remember the House’s white board?

If you ever managed complex, huge real time information systems, you know how difficult it is to determine the actual root cause of a problem. You know, how frustrating it can be to confuse a symptom with a cause, and how difficult it is to determine with certainty the sequence of root cause(s)/causes, which have eventually triggered the problem(s).

How does problem-solving requirement translate in real life? An option is to follow the principle of differential diagnostic procedure (DDP) for Incident management.

Putting together a DDP team composed of experts in different technical and non technical fields, setting up the right DDP structure and culture, in particular a no blame policy to ensure all opinions can be expressed (Do not operate like House!). Overtime, the DDP principles and discipline will increase the troubleshooting and analytical skills of the team, it will become a mindset. Ultimately, DDP will reduce the MTTR i.e Mid Time to Restore/Resolve.

DDP team members will vary from phase to phase but members could include Application Development, Operation Center, Performance team, Infrastructure and business function (e.g. customer facing team).

Typical script used during a DDF will be in 4 steps:

Step 1: Information gathering. The chair person will gather information on the impacted service such as timing of occurrence, symptoms of the issue, logs, and list of recent changes.

Step 2: Candidate conditions (Listing of potential root cause).

Step 3: Sequencing. The candidate conditions will be sorted by the most likely causes. Specific analysis will be requested for the most likely causes in order to rule in or out the candidate condition.

Step 4: Fixing. Candidate condition should be mitigated and proper monitoring implemented for measuring effectiveness of the solution.

Each of the step analysis, options, solutions, and validation should be discussed by the DDF for cross challenge/validation of assumptions and solutions.

IT under pressure: McKinsey Global Survey results

27 Thursday Mar 2014

Posted by Edouard Boris in Talent Management

≈ Leave a comment

Tags

Business Continuity, Incident Management

Read the McKinsey report.

I’m interested in your views – Please contribute.

Mine is that the report is spot on number of points, but does not develop any recommendations.

I observed that technical & analytical skills are different, and unfortunately, with the increasing complexity of information systems, IT staff do need to connect the dots – ‘read the matrix’, or they aren’t capable of addressing the most complex situations because they remain in their own expertise silos, missing the big picture. Analytical training, use case scenarios, cross function-training, workshops, post mortem analysis and coaching are the keys. It takes time and energy, but certainly contribute to reduce the MTTR.

In addition, the report does point out the lack of clear career path. I designed in the past separate management and ‘individual contributor’ career paths, formalizing the expectations and more importantly: an engineer becoming an expert will be recognized, compensated and valued, at comparable levels within the company and outside on the market.

Ed Boris

~ Expert in digital transformation

Tag Archives: Incident Management

British Airways Technology Chaos, really caused by a power surge?

Airport chaos – The inquiry terms explained

How Doctor House can help your team improving critical thinking and problem solving competencies?

IT under pressure: McKinsey Global Survey results