• About Ed Boris
  • Contact details
  • Linkedin Profile

Ed Boris

~ Expert in digital transformation

Ed Boris

Category Archives: Cloud

What most CIOs and CMOs miss when they negotiate their SaaS SLA.

21 Thursday Jan 2021

Posted by Edouard Boris in Business Continuity, Cloud, Data Science, Digital Transformation, SAAS

≈ Leave a comment

Tags

Integrations, Retail

When negotiating a SaaS SLA (Software As A Service Service Level Agreement), CIOs and CMOs often fail to consider the integrations between SaaS and on-premise applications, such as ERP, stock management or order fulfilment. The business logic and data required for these integrations are crucial aspects of the SaaS model.

Failing to ensure the integrity of these integrations can lead to several service degradations, including:

  • Consumer dissatisfaction, leading to revenue loss through service credits and lost sales
  • Delayed or incomplete product deliveries
  • Negative publicity on social media
  • Inaccurate stock levels, leading to an inability to fulfil orders
  • Revenue loss due to technical capacity shortages, especially during traffic spikes, where even if the core capabilities can handle the demand, integration failures can negatively impact the consumer experience
  • Time-consuming incident resolution

SaaS integrations are often treated as mere technical tasks rather than being designed as part of the service. This approach often neglects crucial factors such as:

  • Product & Market strategy
  • Service design
  • Pricing strategy
  • SLA
  • Technical architecture

A comprehensive Integration SLA should encompass the following elements:

  • Product strategy
  • Market strategy
  • Detailed SLA for each class of integration, including monitoring – Integrations should be viewed as transactional information
  • Pricing strategy

It is essential to recognise that no service provider operates in isolation. Understanding the operating ecosystem is crucial for designing services and catalogues and structuring the SLA.

Listing integrations with their associated service levels in the SLA helps structure the relationship between the customer and the service provider.

British Airways Technology Chaos, really caused by a power surge?

18 Sunday Jun 2017

Posted by Edouard Boris in Business Continuity, Cloud, Innovation, SAAS

≈ Leave a comment

Tags

Capacity Management, Incident Management, Service Management

Most people are questioning the BA’s version on how their entire Information System went down on  May 27th 2017, impacting 75,000 passengers for up to 48 hours and will cost up to £80m.

British Airways states that a human error caused by an engineer who disconnected a power supply cable trigerred major outage due to  a power surge.
The question is how a such outage lasted so long? The “power surge” term is misleading, because most people will think power in terms of electricity  as opposed to Information Ecosystem.
In terms of Service Outage Communication, the challenge is to inform without revealing some embarrassing facts, the challenge is to partially say the truth without lying. In this instance, I must admit that BA is doing a good job.
My theory is that BA’s system crashed as a result of the power outage, but BA’s team did not restart the entire ecosystem in sequence. My assumption is that BA’s system were all restarted simultaneously causing what they have called the “power surge“. The question is whether BA had a datacenter restart runbook, or not, and whether if the required documentation existed, whether it was ever tested.
Complex ecosystems require to restart key Infrastructure components, but following a pre-established sequence. For example, the core storage first, then database cashing infrastructure followed by database systems, this is even more true with architectures based on microservices.
In other words, backend systems should be restarted first followed by frontends. If you do not follow a pre-established sequence, the different components of the ecosystems will randomly resume they operations and start “talking” and expect answers. When a non synchronised datacenter restart is performed,   It is likely to end up with data corruption. Furthermore, as the front-end caching infrastructure is not warm, the backend will crash under the load, preventing the reopening of services.
If this scenario happened at BA, the databases storing flight reservations, flight plans and customer details got corrupted up to a point where it became impossible to resume their operations from the second datacentre, also now partially corrupted as a result of the active-active synchronisation performed in between the two datacenter.

British Airways had then no other options than to restore backups and then replay system logs of unsynchronised systems, and then only resume synchronisations with the second datacenter.

Obviously, this is a much more difficult reality to explain, but I talked to several IT experts and no-one, absolutely nobody is buying the power surge story.
I’m looking forward to hearing from the internal investigation that BA’s chief executive has already launched.

Get Ready because Black Friday this year is going to be Bloody Friday.

11 Wednesday Nov 2015

Posted by Edouard Boris in Business Continuity, Cloud, Digital Transformation, SAAS, SmartSourcing

≈ Leave a comment

Tags

Black Friday, Capacity Management, Retail, saas

All the major retailers in the UK are prepared and are announcing their Black Friday super productions:

The Award for the Best Comedy goes to Tesco!

Screen Shot 2015-11-10 at 21.08.37

Last year I wrote a post on “Taking orders is great but how about delivering on time and on quality?”. It is pretty hilarious but Tesco have already given up.

They announce that “Due to unexpected high demand all deliveries will take 5-7 days” and “Express delivery is currently unavailable”.

No, you are not dreaming, today 11th November, 16 days before Black Friday, Tesco are informing us that “Due to unexpected demand” they can’t deliver on time. Tesco is officially inventing a new concept: The Unexpected Expected High Demand, LoL. For a company breathing Customer Satisfaction, this is very interesting. Basically, Tesco has made the decision not to invest in sufficient capacity on the front and back office. They will be, knowingly, selling beyond their firm value chain capacity.

The award for the Best Customer engagement goes to Argos.co.uk.

Screen Shot 2015-11-10 at 21.07.59

They are offering to consumers the option to register in order to get “quicker access to our biggest deals and faster in-store collection from our Fast Track counter in-store when you buy online. Plus we’ll hold your item for 7 days, so you can pick it up when convenient.”.  Very good Argos, “Quicker access” means that you’ll get a link to the page and you will have a VIP pass to the site when it will be blocked because it is too busy. That’s the e-version of the stamp at the top of your hand giving you access to the night club. You remember? The queue outside, you have your stamp and you get in and out as many times as you went. Argos will make you a VIP.  Well done Argos, however it would be better to get your capacity planning right so that you don’t need to implement it. Argos, have already given up on delivery though.

Remember that last year the carriers complained that the retailers did not inform them on planned demand. It will be interesting to see whether this year the carriers or the retailers are blaming each other.

The award for the best “we are mastering it” goes to Amazon and John Lewis.

John Lewis

Amazon's site

The two retailers are simply informing their consumers of the dates of Black Friday and also that they already have ongoing promotions. Last year, both retailers delivered both on performance and customer service, with an advantage to John Lewis because they pay all their taxes in the UK and because it is good for the UK economy. Amazon is knowingly applying a tax avoidance strategy, even though things should get better.

Stay tuned…

51.437820 -0.187856

Airports chaos: Why the service impact lasted 16,000 minutes rather than 45 minutes as initially reported.

05 Monday Jan 2015

Posted by Edouard Boris in Business Continuity, Cloud, Cyber, Digital Transformation, RightSourcing

≈ 1 Comment

Tags

Airport Chaos, Business Transformation, Capacity Management, capacity planning, NATS, post mortem, service delivery, service design, Service Management, service strategy

I’m following up on my last post (read for full details) about the Airport chaos which occurred on the 12th of December 2014 at Swanwick air traffic control centre.

What happened?

NATS, the UK-based global air traffic management company declared that system was back up and running 45 minutes after the event the failure.

An independent inquiry.

On the 15.12.14, NATS declared that the UK Civil Aviation Authority (CAA) will carry out “an independent inquiry following the disruption caused by the failure in air traffic management systems”.

On the NATS’ web site, there is only a mention of the high level plan of the independent inquiry (I’ll explain it in a new post tomorrow). However, drilling down into the CAA’s web site, I was able to find the inquiry terms of reference.

Timelines of events and service impacts.

As provided by the CAA, this is where it gets interesting:

1. Service outage started at approximately 1515 GMT.  Following “the fault in a primary and back-up system led to a failure in the flight data server in the Area Control (AC) Operations Room”.

2. Service restoration starts : “Restrictions were gradually lifted from approximately 1605 GMT with a rapid recovery to full capacity by the middle of the Friday evening”.

This is not a precise timing, however the CAA provides more insights on the true service impact.

The CAA confirms that “Delays and cancellations were incurred totalling some 16,000 MINUTES”. The 45 minutes initially reported represented only system downtime, not service impact measured from a business perspective.

imageimage

“Airlines cancelled around 80 flights: estimated to be 2,000 minutes as a consequence of the restrictions put in place to manage traffic”.

Furthermore, 14,000 minutes  as “result of the phased recovery to prevent overloads and takes account of ground congestion at the major airports”.

“Overall around 450 aircraft were delayed of the 6,000 handled on the 12 December and the average delay to the 450 flights was approximately 45 minutes”.

The CAA reminds us that a “failure affecting the same operations room at Swanwick on 7 December 2013, which resulted in total delay amounting to 126,000 minutes and which impacted 1,412 flights and the cancellation of 300”.

NATS made a mistake by not communicating the progress made over the full service recovery and eventually on the total impact on service uptime.

Service Performance.

Remember that no one cares whether your servers are up and running when no one, or only some of your customers can access your IT services, for example SAP, email, document management, internet corporate site or ecommerce site. This is the same here, systems were up but aircraft could not take off and passengers were badly impacted.

Your service performance should always be measured as being perceived from your customer’s point of view, not from a piece of infrastructure being up perspective.

A service is very rarely operating in isolation, it operates within an echo system made of your own capabilities and what your suppliers and partners are delivering within this ecosystem. Did the hotels and restaurants  have enough vacancies to welcome the passengers?

Once services have been restored, everyone should be concerned by  customers  still suffering from the consequences, such as holidays cancelled (and potentially not reimbursed) or as after Black Friday (see my post) when the products were delivered after Christmas, long after services being actually restored and stable. Will the actual whole cost of the outage at the Swanwick air traffic control centre be ever known? I doubt.

However, NATS has already announced that “there will be a financial consequence for the company from the delay caused. Under the company’s regulatory performance regime, customers will receive a rebate on charges in the future”.

Capacity management.

The NATS managing director of Operation declared during the incident resolution: “These things are relatively rare. We are a very busy island for air traffic control, so we’re always going to be operating near capacity ”.

This is a very concerning statement. Getting service impacted by shortage in capacity is not uncommon (I’m not saying it is satisfactory) when either capacity requirements aren’t properly expressed by the business, or, when the same requirements aren’t adequately translated into efficient technical design. However, it is the responsibility of the CIO to properly and efficiently document and communicate the risks incurred by potential shortage in capacity.

Vince cable declared that the incident have been caused by lack of IT investments. Well, the question is now whether the investments were submitted and refused. The inquiry will need to determine whether the risks of running “at capacity” were properly communicated to the board.

The CAA is expecting to publish the report by the 29 March 2015.

DROPbox?

22 Wednesday Oct 2014

Posted by Edouard Boris in Cloud, Security

≈ Leave a comment

Tags

cloud, security

Terrible publicity for Dropbox, it started with an article on the 11/10 where Edward Snowden said “Get Rid Of Dropbox,” Avoid Facebook And Google.

Today, yet an other bad news from Dropbox with “An email with the subject “important” tells recipients that they must sign into Dropbox in order to view a document too big to be sent via regular email, but clicking on the link included in the message brings people to a fake Dropbox login page that is actually hosted on Dropbox.”

Consider using SpiderOak

Is your online retailer or Service provider keeping their Payment Card Industry certification up to date?

11 Friday Jul 2014

Posted by Edouard Boris in Cloud, Payment, Security

≈ Leave a comment

Tags

cloud, payment, PCI, saas, security

Several weeks ago, I wrote about PCI certification. A certification is valid for one year and therefore needs to be renewed.

Visa keeps track of the registry @ http://www.visa.com/splisting/searchGrsp.do

According to Visa:

“For service providers published on the Registry, if Visa does not receive the appropriate revalidation documents:

  • Within 1 – 60 days upon expiry of the validation documents, the service provider will be highlighted in Yellow on the Registry.
  • Within 61 – 90 days upon expiry of the validation documents, the service provider will be highlighted in Red on the Registry.
  • After 90 days, the service provider will be removed from the Registry.

Whereas you shop online or whether you outsourced your online payments to an external provider, it might be worth periodically checking the online status of their PCI certification.

Follow Ed Boris on WordPress.com

Recent posts

  • La vie du Colonel Edmond Robert Lévêque et de Marguerite Lévêque June 10, 2023
  • What most CIOs and CMOs miss when they negotiate their SaaS SLA. January 21, 2021
  • Ethic, Business, Politics and Global Warming September 16, 2018

Tags

agile Airport Chaos Architecture Design Black Friday Business Continuity Business Transformation Capacity Management capacity planning change cloud Incident Management Integrations Linkedin NATS payment PCI Planned Obsolescence post mortem Retail saas security service delivery service design Service Management service strategy Social Social media Software Design

Categories

  • Agile
  • Black Friday 2014
  • Business Continuity
  • Business Ethic
  • Cloud
  • Cyber
  • Data Science
  • Digital Transformation
  • Financing Decision
  • Innovation
  • New Trends
  • Open Compute Project
  • Payment
  • Retail
  • RightSourcing
  • Risk management
  • SAAS
  • Security
  • SmartSourcing
  • Social
  • Talent Management
  • Uncategorized

Archives

  • June 2023
  • January 2021
  • September 2018
  • June 2017
  • March 2017
  • April 2016
  • November 2015
  • January 2015
  • December 2014
  • October 2014
  • July 2014
  • June 2014
  • May 2014
  • March 2014

Blog at WordPress.com.

  • Subscribe Subscribed
    • Ed Boris
    • Already have a WordPress.com account? Log in now.
    • Ed Boris
    • Subscribe Subscribed
    • Sign up
    • Log in
    • Report this content
    • View site in Reader
    • Manage subscriptions
    • Collapse this bar
 

Loading Comments...