Latest news about Bitcoin and all cryptocurrencies. Your daily crypto news habit.
In this series of posts, I will walk you through architecting, building and deploying a large scale, multi-region, active-active architecture, all while trying to break it. My initial idea is to split the series into the following structure:
- The Quest for Availability. (this post)
- Why and how do we build a Multi-Region, Active-Active Architecture?
- Building a Multi-Region, Active-Active Serverless Backend.
- Breaking things with Chaos Engineering.
Of course, it might and probably will change as I start writing, so feel free to steer the course of (t)his (s)tory :)
System Failure.
One of my favourite quote, and also one that influenced my thinking of software engineering is one from Werner Vogels, CTO at Amazon.com.
“Failures are a given and everything will eventually fail over time.”
Indeed, we live in a chaotic world, where failure is a first-class citizen. Failure usually comes in three flavours; the early failures, the wear-out (or late) failure and the random failures, each coming at a different stage in the life of any given system.
The “bathtub” curve of failure.
Early failures are essentially related to programming and configuration bugs (typos, variable mutations, networking issues like ports and IP routing misconfiguration, security, etc…). Over time, as the product (or version) matures and as automation kicks-in, those failures tend to naturally diminish.
Note: I just mentioned “automation kicks-in”! This really means that you have to be using automation to experience this natural declining behaviour of early failures. Doing things manually won’t allow for that luxury.
Wear-out (or late) failures — you often read online that software systems, unlike physical components, are not subject to wear-out failures. Well, software is running on hardware, right? Even in the cloud, software is subject to hardware failure and therefore should be accounted for. But that’s not all, wear-out failures also and most often are, related to configuration drifts. Indeed, configuration drift accounts for the majority of reasons why disaster recovery and high availability systems fail.
Random failures are basically, well, random. A squirrel eating your cables. A shark brushing its teeth on transatlantic cables. A drunk truck driver aiming at the data-centre. Zeus playing with lightings. Don’t be a fool, over time, you too will eventually fall victim to ridiculous unexpected failures.
BUT
— we live in a world where velocity is critical and by that, I mean being able to deliver software continuously. To give you an idea of velocity at scale, Amazon.com, in 2014, was doing approximately 50 million deployments a year, that’s roughly 1.6 deployments per seconds. Of course, not everyone needs to do that, but the velocity of software delivery, even at smaller scale does have a big impact on customer satisfaction and retention.
So how does velocity impact our “bathtub” failure rate curve? Well, it now looks more like the mouth of a shark ready to eat you raw. And indeed, for each new deployment, new early failures will be thrown at you, hoping to take your system down.
As you can easily notice, this creates a tension between the pursuit of high availability and the speed of innovation. If you develop and ship new features slowly, you will have a better availability — but your customer will probably seek innovations from someone else. On the other hand, if you go fast and innovate constantly on behalf of your customers, you risk failures and downtime — which they will not like.
To help you grasp what you are fighting against, I included the table of “The Infamous Nines” of availability. Let that table sink in for a minute.
If you want to have 5-nines of availability, you can only afford 5 minutes of downtime a year!!“The Infamous Nines” of Availability
Few years ago, I experienced first-hand a complete system meltdown. It took our team few minutes just to realise what was happening, another few minutes to get our sh*t together and slow our heart-rate down and another couple hours to complete a full system restore.
Lesson learned: If __any__ humans are involved in restoring your system, you can say bye-bye to the Infamous Nines.
So how can you reconcile both availability and velocity for the greater good of your customers?
There are three important things, namely:
- Architecting highly reliable and available systems.
- Tooling, automation and continuous delivery.
- Culture.
Simply put, what you should aim for is having everyone in the team confident enough to push things into production without being scared of failure. And the best way to do so is by first having highly available and reliable systems, having the right tooling in place and by nurturing a culture where failure is accepted and cherished. In this following, I will focus more on the availability and reliability aspect of things.
It is worth remembering, that generally speaking a reliable system has high availability but an available system may or may not be very reliable.
Understanding Availability.
Consider you have 2 components, X and Y, respectively with 99% and 99.99% availability. If you put those two components in series, the overall availability of the system will get worse.
It is worth noting that the common wisdom “the chain is as strong as the weakest link” is wrong here — the chain is actually worsened.
On the other hand, if you take the worse of these components, in that case, A with 99% availability, but put it in parallel, you increase your overall system availability dramatically. The beauty of math at work my friends!
What is the take away from that?
Component redundancy increases availability significantly!
Note: you can also calculate availability with the following equation:
Calculating System Availability
Alright, now that we understand that part, let’s take a look at how AWS Regions are designed.
AWS Regions.
From the AWS website, you can read the following:
The AWS Cloud infrastructure is built around Regions and Availability Zones (“AZs”). A Region is a physical location in the world where we have multiple Availability Zones. Availability Zones consist of one or more discrete data centers, each with redundant power, networking and connectivity, housed in separate facilities.
Since a picture is worth 48 words, an AWS Region looks something like that.
An example AWS Region with 3Â AZs.
Now you probably understand why AWS is always, always talking and advising its customers to deploy their applications across multi-AZ, preferably three of them. Just because of this equation my friends.
By deploying your application across multiple AZs, you magically increase, and with minimal effort, it’s availability.
Application deployed across multi-AZ using a Elastic Load Balancer (ELB).
This is also the reason why using AWS regional services like S3, DynamoDB, SQS, Kinesis, Lambda or ELBs just to name a few, is a good idea — they are by default, using multiple AZs under the hood. And this is also why using RDS configured in multi-AZ deployment is neat!
The price of Availability
One thing to remember though is that availability does have a cost associated with it. The more available your application needs to be, the more complexity is required and therefore the more expensive it becomes.
Indeed, highly available applications have stringent requirements for development, test and validation. But especially, they must be reliable, and by that, I mean fully automated and supporting self-healing, which is the capability for a system to auto-magically recover from failure. They must dynamically acquire computing resources to meet demand but they also should be able to mitigate disruptions such as misconfigurations or transient network issues. Finally, it also requires that all aspects of this automation and self-healing capability be developed, tested and validated to the same highest standards as the application itself. This takes time, money and the right people, thus it costs more.
Taking it up a notch
While there are tens, or even hundreds of techniques used to increase application reliability and availability, I want to mention two that in my opinion stand-out.
Exponential backoff
Typical components in a software system include multiple (service) servers, load balancers, databases, DNS servers, etc. In operation, and subject to potential failures as discussed earlier, any of these can start generating errors. The default technique for dealing with these errors is to implement retries on the requester side. This simple technique increases the reliability of the application and reduces operational costs for the developer.However, at scale and if requesters attempt to retry the failed operation as soon as an error occurs, the network can quickly become saturated with new and retired requests, each competing for network bandwidth — and the pattern would continue forever until a full system meltdown would occur.To avoid such scenarios, exponential backoff algorithms must be used. Exponential backoff algorithms gradually increase the rate at which retries are performed, thus avoiding network congestion scenarios.
In its most simple form, a pseudo exponential backoff algorithm looks like that:
Simple exponential backoff algorithm
Note: If you use concurrent clients, you can add jitter to the wait function to help your requests succeed faster. See here.
Luckily many SDKs and software libraries, including the AWS ones, implement a version (often more sophisticated) of this algorithms. However don’t assume it, always verify and test for it.
Queues
Another important pattern to increase your application’s reliability is using queues in what is often called message-passing architecture.The queue sits between the API and the workers, allowing for the decoupling of components.
Message-passing pattern with queues.
Queues give the ability for clients to fire-and-forget requests, letting the task, now in the queue, to be handled when the right time comes by the workers. This asynchronous pattern is incredibly powerful at increasing the reliability of complex distributed applications but is unfortunately not as straightforward to put in place as the exponential backoff algorithms since it requires re-designing the client side. Indeed, requests do not return the result anymore, but a JobID, which can be used to retrieve the result when it is ready.
Cherry on the cake
Combining message-passing patterns with exponential backoff will take you a long way in your journey to minimise the effect of failures on your availability and are in the top 10 of most important things I have learned to architect for.
That’s is for this part. I hope you have enjoyed it. Please do not hesitate to give feedback, share your own opinion or simply clap your hands. The next part will hopefully be published next week. Stay tuned!
-Adrian
The Quest for Availability. was originally published in Hacker Noon on Medium, where people are continuing the conversation by highlighting and responding to this story.
Disclaimer
The views and opinions expressed in this article are solely those of the authors and do not reflect the views of Bitcoin Insider. Every investment and trading move involves risk - this is especially true for cryptocurrencies given their volatility. We strongly advise our readers to conduct their own research when making a decision.