25th Nov 2016 by Kurt Garloff

Service Interruption on OTC

Executive Summary

The Open Telekom Cloud (OTC) will undergo major changes in December.

During this time, the Management/Control Plane (Web Interface and API) will be unavailable several times for a number of minutes. The User/Data Plane (Virtual Resources) will be unreachable for a few minutes as well. In addition, due to the nature of the changes, the Virtual Machines (ECS) will not be live-migrated during this round of updates; instead all VMs will be terminated and restarted a few minutes later (cold migration). The User Plane changes will be performed in AZ1 (eu-de-01) before changes are started in AZ2 (eu-de-02), so cloud native applications should not see a service disruption.

See below for the exact schedule of the changes.

OpenStack upgrade - control plane interruption

OTC is currently based on a hardened and enhanced version of OpenStack Juno. To deliver additional features and enhancements from newer OpenStack versions, the OTC infrastructure will reveice major upgrades in December. After the upgrade, OTC will be based on a hardened and enhanced version of OpenStack Mitaka and thus be based on a 18months newer open source platform. While customers will immediately benefit from some enhanced capabilities (e.g. API compatibility with newer tools), some of the then possible features will only be enabled in the subsequent months.

OpenStack is based on many collaborating services; while many of the services can be run and updated independently and without service interruption, there are dependencies between the core identity, compute, storage and image services. A major update of these from Juno to Mitaka in an uninterrupted rolling fashion has been deemed too risky by T-Systems engineering and operations teams.

The core services will thus be upgraded all together -- this will result in control functions (query or change the virtual resource, such as e.g. starting new VMs or attaching storage to a running VM) being unavailable during the upgrade of the OpenStack services. Existing virtual machines however will continue to work and have access to the assigned networking and storage resources.

T-Systems is working with the main technology partner Huawei to follow the upstream OpenStack release cycle more quickly and in smaller steps in the future to enable the possibility for rolling upgrades that can be performed without any control function interruption.

Cold migration -- user plane interruption

The nature of the changes requires that each Compute host will be restarted, which results in all VMs on the host being terminated. While in general VMs on OTC are live migrated to other hosts to avoid customer impact for host maintenance (such as the installation of security updates to the hypervisor), the mass live migration together with a heterogeneous environment caused by the major OpenStack updates have been rejected by the engineering team based on a risk and performance analysis.

This means that the VMs will be shut down and restarted a few minutes later again after the hosts have rebooted (cold migration).

The shutdown of VMs will only affect the VMs of one availability zone at a time. A cloud-native application setup that has been structured to survive an outage of one availability zone will thus survive the cold migration without any service unavailability, though the application might see performance degradation due to resource constraints when only one AZ is available.

T-Systems is aware that not all applications on OTC are cloud-native and handle cold migration well. We are currently working with Huawei to optimize the performance and risks associated with massive live migrations and work on processes to identify VM specific risks to determine eligibility for live migration in the future, so cold migrations can be largely avoided. Please note that special flavors using passthrough (SRIOV) network acceleration and local disk access (SAP HANA, HPF and DiskIntensive flavors) will still undergo cold migrations in the future due to limitations in live migrating hardware state. There are also limitations caused by custom OS kernels resulting in failed migrations.

Customer effects

The effects on customer workloads depends on the nature of the workload. Cloud-native workloads will have other challenges than classical workloads.

Effect on cloud-native workloads

Cloud-native applications are built with redundancy at the software level. In particular, the application is expected to be aware of the availability zone concept and have distributed resources allowing the application to survive the unavailability of a complete AZ.

We recommend customers to test the scenario to ensure that an AZ outage is indeed handled fine by the application.

A cloud-native application tends to use the APIs (or use cloud management or orchestration tools that do use the API) in order to adjust the virtual footprint to the load situation and to the availability of resources. During the control plane downtime (due to the OpenStack major upgrade), API endpoint may be unavailable or might fail control operations. The application can thus not breathe during those periods; while this may result in suboptimal performance, the application is expected to handle this gracefully. Customers expecting load increases during the time may compensate by manually triggering scaling operations.

Effect on classical workloads

Classical workloads will be affected by the unavailability of resources in one availability zone. Customers should thus expect the application to be unavailable during the cold migration.

We recommend customers to prepare for the unavailability; customers can manually move resources to the currently available zone and thus ensure continuous availability of their service.

In any case, we recommend to ensure that all VMs of the application handle reboots fine and come up again without manual intervention.

Schedule

.... to be written ....

Updates and further information

SDMs Live Blog?