There are currently no issues to report


 

Horizon Platform Outage 14th November 2018

 

Gamma Q&A

 

 

  1. How did this happen?

A bug was discovered on the Horizon platform in the early hours of the morning of the 14th November.  This bug prevented a large proportion of Horizon phones from registering to part of the platform.  Although the bug was resolved relatively quickly (10am), the impact of the volume of re-registration attempts, coupled with peak call traffic created instability on the wider platform.  Our engineering teams were engaged with technology partners that support various part of the Horizon platform for the remainder of the day, introducing various measures to stabilise the platform.

 

  1. Why was the Horizon platform not resilient?

The Horizon platform is fully resilient across 4 geographically diverse network data centres as well as multiple SBCs (session boarder controllers) – which are the parts of the network that manage the device authentication and call traffic and application servers, again with geographical resiliency.  The nature of the bug and the wider impact to the platform meant there was no clean failover scenario that could be undertaken to restore stability to the platform.

 

  1. Why was there no contingency possible to mitigate this?

Due to the wider impact of the Horizon devices reattempting registration and the peak call traffic, the SBCs entered a protective state, which enables them to continue to operate at a level that prevents them from becoming overloaded.  The work with our technology partners was focussed on the measures we could undertake to allow the SBCs to recover fully and enable handset authentication and successful calls.  The network condition was very complex and as such is not a typical contingency scenario.  However we now have a much better understanding of the contributing factors to the instability of the platform and how our network controls affected the behaviour of the instability and we have firm plans on the steps we need to take to significantly reduce the time to resolve should a similar scenario

 

  1. Why did the Disaster Recovery inbuilt into Horizon fail? (Predefined diverts)

From our analysis, all pre-configured and active DR plans would have continued functioning as expected provided they were routing to off-net numbers (i.e. non-Gamma). Users may have experienced issues logging onto the Gamma Portal to activate or amend any of the following services throughout this outage.

 

Call Diverts/Forwards

Twinning

Remote Office

Sequential Ring

 

Horizon Connect calls were unaffected, both inbound and outbound.

 

 

  1. Why did the Horizon Portal fail during this incident?

We received about 3 times the number of concurrent users/active sessions on the Horizon portal than we’d expect on a busy weekday. As a result of this, the Horizon portal, failed to deal with all requests in a timely manner, meaning users saw page timeouts and were presented with generic server error messages (“error 500”).   In an attempt to remedy this, the Horizon development team applied a configuration change to increase the processing capability available to each of our Horizon web server applications, effectively granting them access to additional server resources to process the web page requests.

 

  1. If this was due to a software patch, why was this not tested before release / rolled back?

The cause of this incident was not a software patch, but a bug on the Horizon platform and the onward impact of the load placed on the SBCs once the patch was applied.  We have a standard testing process for all configuration updates.  The decision to patch the bug was taken within an incident environment to resolve the initial root cause and this was successful.

 

  1. Was the patch not tested in advance?

The patch was in the process of being tested by our Engineering team however wasn’t classed as ‘critical’ therefore wasn’t subject to an expedited deployment.

 

  1. Why were the updates so vague and sometimes none at all?

During the main part of the day (from 10am onwards) we were working closely with our technology partners to take a series of steps to stabilise the platform.  The incident was very complex in nature and it meant, we were unable to communicate any certainty on time to resolve.  Equally our network monitoring provides visibility of handsets registered and call traffic volumes and profiles, but the impact of this incident was very difficult to articulate, as the behaviours manifested across our Horizon users was so varied.  We could see progress was being made at a network level, but this wasn’t always consistent with what customers were advising.  We will be conducting a full review of the communication methods and content, throughout the incident and will be setting out plans to dramatically improve this over the coming weeks.

 

  1. The lack of meaningful updates tells me you didn’t know what the issue was!

We understood the root cause very early in the day, and the updates we were providing were geared towards the visibility we had at a network level of the change in behaviour following the remedial steps we were taking throughout the day.

 

  1. Convince me Gamma will prevent this from happening in the future.

The Horizon incident on the 14th November was unprecedented for us in terms of scale of impact.  We have previously seen good stability across this platform, the only other outage we’ve had on Horizon this year was on the 21st March and impacted the supplementary services (Client, Integrator etc) – prior to that the last outage was November 2017 (BT issue impacting 831 users) and before that was March 2016.  The platform has been running at 99.99% availability since 2014 and 100% availability in 2018 prior to the 14th.  All our efforts are pointed toward bringing our service back to and beyond the levels you expect over the long term, and restoring the faith in our platform and trust of our customers.