• About Ed Boris
  • Contact details
  • Linkedin Profile

Ed Boris

~ Expert in digital transformation

Ed Boris

Monthly Archives: January 2015

Facebook down: “It’s us! no! our engineers caused it”!

27 Tuesday Jan 2015

Posted by Edouard Boris in Business Continuity, New Trends, Open Compute Project

≈ Leave a comment

Tags

agile, Business Continuity, change

You probably know by now that Facebook went down this morning for 50mn.

Some hackers claimed to have caused it when FB actually reported that they introduced a change in their configuration management system.

As I wrote previously, and tested with so many interviewees:

‘What is the first cause of incident in the industry?”

Forget people, software, hardware, your grandmother, the first cause of incident is CHANGE. I’m sure you have heard of the idiom “If it ain’t broken, don’t fix it.

So FB introduced a change in their configuration management system, which triggered an outage for billions of people. You can work out 1 billion x 50mn = time recovered for people to actually socializing with humans that they could see! Great news.

In recent years FB have pushed an initiative called “Facebook’s Open Compute Project”, designed to drive standardization and automation right through the datacenter).

It is very surprising, despite the resiliency and multiple datacenters, that one single change was able to take down such service during 50mn.

I successfully changed the battery of my iPhone 4s but I’m angry.

26 Monday Jan 2015

Posted by Edouard Boris in Agile

≈ Leave a comment

Tags

Architecture Design, Planned Obsolescence, Software Design

So why the anger?

Everything went pretty smoothly, i.e I did not break anything: I ordered a kit online for £15 including a battery, and several screw drivers, some plastic tools and more.

Opening the iPhone was easy; however taking the battery off was a challenge as it was glued to the bottom of the iPhone. The risk is real to break electronic components as they are minuscule, and nothing is designed to be replaced by the user.

Changing the battery took maybe 15 minutes and cold sweat, however, it required a factory reset of the device, a backup restore, which in total took about 2 hours.

A friend of mine told me that it is “impressive” that I managed to change the battery of the iphone.

How have we come to a point where changing a battery is an impressive job?

Do you get also a round of applause when you replace a light bulb at home? or the last time you replaced the battery of your digital camera?

After changing a light bulb, does it take you 2 hours of your day and a complete reset of your home electricity?

The battery of my IPhone lasted about 3 hours; most people would have purchased a new phone. This is why it is all wrong.

I remember by Ericson GH337, and changing the battery was a piece of cake, and it was 20 years ago!

Planned obsolescence!

Too many devices have a scheduled end of life. Such as “when the printer decides, somewhat arbitrarily, that the pads are worn out, that puts the whole device out of commission”, and it happened to my Epson printer.

Smartphones are great devices, and they are expensive, very expensive. As consumers, we have a voice to be heard.

Christmas period was a nightmare for M&S general merchandise.

09 Friday Jan 2015

Posted by Edouard Boris in Black Friday 2014

≈ Leave a comment

Tags

Black Friday, Business Continuity, Business Transformation, Capacity Management, Retail

M&S have conducted their quarter 3 2014/2015 management statement, and the news aren’t good.

Marc Bolland, Chief Executive, said: “We had a difficult quarter in General Merchandise, dominated by unseasonal conditions and an unsatisfactory performance in our e-commerce distribution centre.”

This is the second year in a row that M&S is blaming the weather conditions. (see Q3 2013/2014 press release).

Not a surprise  that M Bolland named their distribution centre as a cause of disruption as the CEO of John Lewis made an allusion to it on Monday.

Blaming the e-commerce distribution centre is not the solution on itself. I recommend M&S to review their entire ecosystem and capacity planning.

Should the retailers adapt their online stock levels to their actual capacity to deliver on time and on quality? Or the other way around, should they adapt their delivery capacity to meet the demand? In other words, does it make sense to sell 100 items when they know that they can deliver only 50 on time and on quality?

More information on Taking orders is great but how about delivering on time and on quality?

51.507351 -0.127758

Xmas John Lewis’ sales tumbled and Black Friday is to be blamed!

07 Wednesday Jan 2015

Posted by Edouard Boris in Digital Transformation, Retail

≈ 1 Comment

Tags

Black Friday, Business Continuity, Capacity Management

We are now starting to get more data of the real impact of the Black Friday retail frenzy.

On the 10 December 2014, I wrote an article about delivery issues that some of the largest retailers faced as a result of Black Friday craziness. I was also interrogative on the real impact on the bottom line Black Friday will have on the entire peak trading period.

image

John Lewis confirmed to the BBC that Black Friday was “more challenging profitability-wise”.

According to the Guardian, John Lewis saw sales fall back 1.4% in Christmas week as sales of electrical goods were pulled forward by the Black Friday promotional weekend.

Furthermore, John Lewis said to the BBC that “online purchases were behind a rise in Christmas sales, despite shop purchases falling over the festive period and like-for-like sales were up 4.8% in the five weeks to December 27 – as store sales dropped around 1%”.

John Lewis boss added “Black Friday is a blessing in the sky as we can achieve record sales online and that our customers can have confidence in the delivery”.

Andy Street made a direct comment on its competitors who weren’t able to deliver as a result of the massive surge in sales:

“it is quite challenging for the rest of the industry as it is pushing all this trading in one day”.    The proportion of sales taken online during one day compared to the entire festive period is a new thing in the UK and the “dependence on fulfillment”.

M Street do not think that the retailers “can put the genie back in the bottle”, however he hopes that next Black Friday won’t be bigger.

Street declared in the Guardian that he believed “John Lewis had outperformed rivals because of its investment in IT and delivery facilities, which meant it was able to meet online orders without any hitches during peak periods, including Black Friday”.

Well done to their Online team, analysts and architects who were able to get the investments and capacity solutions correctly sized and approved. This is pretty impressive when one day web visits were up 300% YoY.

It fair to say that Black Friday ‘effect’ is not as productive as retailers expected. Basically, it sounds that they are selling similar volumes but at a discounted price, hence impacting their profitability.

We are now waiting for Christmas sales data from John Lewis competitors.

Airport chaos – The inquiry terms explained

06 Tuesday Jan 2015

Posted by Edouard Boris in Business Continuity, Digital Transformation, Innovation

≈ Leave a comment

Tags

Capacity Management, Incident Management, post mortem, service delivery, Service Management

This post is the second part of the airport chaos.

The inquiry terms explained.

The terms are listed over 5 bullet points requiring some clarifications from a Service Management perspective :

1. Root cause of the incident:

Basically getting down to the bottom of what happened and in particular determining the context which created the conditions leading eventually to the service interruption.  A good analogy would be the analysis performed by firefighters after a blaze. Let’s say that, several people were killed by the smoke (cause of the death), however the root cause of the fire could have been a short-circuit in a multi-socket adaptor due for example, by a human error (too many high-hungry appliances being plugged on the same socket).

Special attention will need to be given to recent changes (change management including change control) which may have occurred over the last circa 30 days prior to the incident.

This process is often called ‘Post Mortem’, I understand that in this type of industry, the term wasn’t used.

2. Incident management:

The CAA will need to look at how the incident was handled. From an IT management perspective, it should cover Incident occurrence, System monitoring (Incident detection & Incident reporting), Service stable, service All Clear, and finally Incident communication throughout service restoration. It should also cover whether any of the different phases could have been handled quicker.

3. Problem management (Part 1) i.e lessons learned from the previous incident (December 2013) and in particular whether all the action items listed as part of the previous incident analysis were properly closed and whether any of the root cause(s) of the 2013 incident played any role in the 2014 one.

4. Business continuity plan & capacity planning: “A review of the levels of resilience and service that should be expected across the air traffic network taking into account relevant international benchmarks”.

This is interesting given the communication made by NATS that the “back-up plans and procedures worked on Friday exactly as they were designed to”. 

5. Problem management (Part 2) “Further measures to avoid technology or process failures in this critical national infrastructure”:

Basically putting together a plan to resolve (as opposed to mitigate) all underlying causes and root causes which led to the chaos. Note that this plan will be reviewed during the next incident investigation (see Problem Management part 1).

In addition to completing the problem management, NATS have requested that CAA put together a plan addressing the new requirements listed in the upcoming Business Continuity plan “Further measures to reduce the impact of any unavoidable disruption”.

In this case, they will be considering not only the actions required to prevent a similar incident from occurring again but also the actions required to address the other risks they will identify during the root cause analysis. Furthermore, they will put together a plan to increase the resiliency and business continuity (see capacity management) during major outages, considering Recovering Point Objective.

The CAA is expecting to publish the report by the 29 March 2015.

Airports chaos: Why the service impact lasted 16,000 minutes rather than 45 minutes as initially reported.

05 Monday Jan 2015

Posted by Edouard Boris in Business Continuity, Cloud, Cyber, Digital Transformation, RightSourcing

≈ 1 Comment

Tags

Airport Chaos, Business Transformation, Capacity Management, capacity planning, NATS, post mortem, service delivery, service design, Service Management, service strategy

I’m following up on my last post (read for full details) about the Airport chaos which occurred on the 12th of December 2014 at Swanwick air traffic control centre.

What happened?

NATS, the UK-based global air traffic management company declared that system was back up and running 45 minutes after the event the failure.

An independent inquiry.

On the 15.12.14, NATS declared that the UK Civil Aviation Authority (CAA) will carry out “an independent inquiry following the disruption caused by the failure in air traffic management systems”.

On the NATS’ web site, there is only a mention of the high level plan of the independent inquiry (I’ll explain it in a new post tomorrow). However, drilling down into the CAA’s web site, I was able to find the inquiry terms of reference.

Timelines of events and service impacts.

As provided by the CAA, this is where it gets interesting:

1. Service outage started at approximately 1515 GMT.  Following “the fault in a primary and back-up system led to a failure in the flight data server in the Area Control (AC) Operations Room”.

2. Service restoration starts : “Restrictions were gradually lifted from approximately 1605 GMT with a rapid recovery to full capacity by the middle of the Friday evening”.

This is not a precise timing, however the CAA provides more insights on the true service impact.

The CAA confirms that “Delays and cancellations were incurred totalling some 16,000 MINUTES”. The 45 minutes initially reported represented only system downtime, not service impact measured from a business perspective.

imageimage

“Airlines cancelled around 80 flights: estimated to be 2,000 minutes as a consequence of the restrictions put in place to manage traffic”.

Furthermore, 14,000 minutes  as “result of the phased recovery to prevent overloads and takes account of ground congestion at the major airports”.

“Overall around 450 aircraft were delayed of the 6,000 handled on the 12 December and the average delay to the 450 flights was approximately 45 minutes”.

The CAA reminds us that a “failure affecting the same operations room at Swanwick on 7 December 2013, which resulted in total delay amounting to 126,000 minutes and which impacted 1,412 flights and the cancellation of 300”.

NATS made a mistake by not communicating the progress made over the full service recovery and eventually on the total impact on service uptime.

Service Performance.

Remember that no one cares whether your servers are up and running when no one, or only some of your customers can access your IT services, for example SAP, email, document management, internet corporate site or ecommerce site. This is the same here, systems were up but aircraft could not take off and passengers were badly impacted.

Your service performance should always be measured as being perceived from your customer’s point of view, not from a piece of infrastructure being up perspective.

A service is very rarely operating in isolation, it operates within an echo system made of your own capabilities and what your suppliers and partners are delivering within this ecosystem. Did the hotels and restaurants  have enough vacancies to welcome the passengers?

Once services have been restored, everyone should be concerned by  customers  still suffering from the consequences, such as holidays cancelled (and potentially not reimbursed) or as after Black Friday (see my post) when the products were delivered after Christmas, long after services being actually restored and stable. Will the actual whole cost of the outage at the Swanwick air traffic control centre be ever known? I doubt.

However, NATS has already announced that “there will be a financial consequence for the company from the delay caused. Under the company’s regulatory performance regime, customers will receive a rebate on charges in the future”.

Capacity management.

The NATS managing director of Operation declared during the incident resolution: “These things are relatively rare. We are a very busy island for air traffic control, so we’re always going to be operating near capacity ”.

This is a very concerning statement. Getting service impacted by shortage in capacity is not uncommon (I’m not saying it is satisfactory) when either capacity requirements aren’t properly expressed by the business, or, when the same requirements aren’t adequately translated into efficient technical design. However, it is the responsibility of the CIO to properly and efficiently document and communicate the risks incurred by potential shortage in capacity.

Vince cable declared that the incident have been caused by lack of IT investments. Well, the question is now whether the investments were submitted and refused. The inquiry will need to determine whether the risks of running “at capacity” were properly communicated to the board.

The CAA is expecting to publish the report by the 29 March 2015.

Follow Ed Boris on WordPress.com

Recent posts

  • La vie du Colonel Edmond Robert Lévêque et de Marguerite Lévêque June 10, 2023
  • What most CIOs and CMOs miss when they negotiate their SaaS SLA. January 21, 2021
  • Ethic, Business, Politics and Global Warming September 16, 2018

Tags

agile Airport Chaos Architecture Design Black Friday Business Continuity Business Transformation Capacity Management capacity planning change cloud Incident Management Integrations Linkedin NATS payment PCI Planned Obsolescence post mortem Retail saas security service delivery service design Service Management service strategy Social Social media Software Design

Categories

  • Agile
  • Black Friday 2014
  • Business Continuity
  • Business Ethic
  • Cloud
  • Cyber
  • Data Science
  • Digital Transformation
  • Financing Decision
  • Innovation
  • New Trends
  • Open Compute Project
  • Payment
  • Retail
  • RightSourcing
  • Risk management
  • SAAS
  • Security
  • SmartSourcing
  • Social
  • Talent Management
  • Uncategorized

Archives

  • June 2023
  • January 2021
  • September 2018
  • June 2017
  • March 2017
  • April 2016
  • November 2015
  • January 2015
  • December 2014
  • October 2014
  • July 2014
  • June 2014
  • May 2014
  • March 2014

Blog at WordPress.com.

  • Subscribe Subscribed
    • Ed Boris
    • Already have a WordPress.com account? Log in now.
    • Ed Boris
    • Subscribe Subscribed
    • Sign up
    • Log in
    • Report this content
    • View site in Reader
    • Manage subscriptions
    • Collapse this bar