Cloud Applications: The Cost of Resilience - How smart Technology changing lives

If you want to run cloud applications fail-safe, you have to think about their architecture. A study compares different scenarios.

Companies that move their applications to the cloud expect not only lower costs but above all greater reliability. Cloud-native applications, at least in theory, run on some virtual machine somewhere in the cloud.

If the VM dies, the application restarts in another VM – or several instances of the application are running on different VMs, so that one is always available somewhere. The fact that things don’t always run so smoothly in practice was shown, for example, by the major failure of AWS in December last year.

The Uptime Institute, which specializes in consulting, research and certification in the field of data centers, took a closer look at the scenario of a distributed, cloud-native application with regard to reliability. The question was: How much effort do you have to make to achieve what level of failsafety, and how much does that cost? Amazon’s AWS served as an example cloud, but the data should be roughly transferrable to other large hyperscalers.

Failsafe at all costs?

The study is based on the division into zones and regions, as used by all major cloud providers: regions such as us-east-1 or eu-central-1 are completely separate from other regions, and the failure of one region should not affect other regions .

Resources are not automatically replicated across different regions. Zones (e.g. us-east-1) are isolated locations within a region and are designed to take over workloads from one another in the event of an instance failure.

The Uptime Institute considers three scenarios: a virtual machine failure, a zone failure, and an entire region failure. A distinction is then made between a failover scenario (if one instance fails, a new instance starts in another VM, which can mean a failure of up to 15 minutes) and an active-active scenario in which the new instance takes over immediately without downtime. A simple WordPress website that delivers static data served as the stateless test application. Stateful applications that process data dynamically are to be examined in a further study.

Multiple zones at no extra charge

Compared to a simple WordPress VM, a scenario with two active VMs and a load balancer (active-active) causes additional costs of 43 percent. In terms of costs, it doesn’t matter whether the second VM runs in the same zone or in a different zone, but it does when it comes to availability: AWS promises an availability of 99.95 percent for two VMs in the same zone, for two VMs in different zones, AWS guarantees 99.99 percent availability. In the failover scenario, the additional costs are 14 percent in both cases.

It becomes more expensive if the instances are distributed across two regions: Depending on whether the instance in the second region is already running or has yet to be ramped up, the additional costs are between 51 and 111 percent in the active-active scenario. The calculated availability is 99.999999 percent – that is less than one second of failure per year.

The study by the Uptime Institute is available for download against submission of contact details. She details the scenarios and cost calculations and discusses both the cloud providers’ service level agreements and the compensation they pay in the event of an outage.