Runscope API Monitoring    Learn More →

What Developers Can Do to Help Achieve High Availability

By Theo Despoudis on .

 Photo by  Daniel Hansen  on  Unsplash

Theo Despoudis is a guest contributor for the Runscope blog. If you're interested in sharing your knowledge with our readers, we would love to have you! Please fill out this short form and we'll get in touch with you.

Whose job is it to achieve high availability? Your first thought may go to the IT Ops folks. After all, they are the ones who are responsible for provisioning infrastructure, monitoring for problems and resolving incidents quickly in order to minimize service disruptions.

Yet the burden of high availability should not be on IT Ops alone. Developers play an important role too in making sure that applications and the infrastructure that hosts them are highly available. Let’s explore how in this article.

Defining Availability

For the uninitiated, availability denotes the degree to which a system or a subsystem is in a functional state. Typically this is expressed mathematically as the following percentage calculation:

100% - (time system was unavailable over a year period)

It is common for end users to expect availability in the range of 99% (or “two nines”) to 99.999 percent (“five nines”).

Why Do Systems Become Unavailable?

At the risk of sounding like a pessimist, I could sum up my philosophy about designing software systems as a form of Murphy’s Law. I assume that if my software can fail, it will fail.

Why will my software fail? The potential reasons are infinite, of course, but below are some of the most common causes of software failures that interrupt availability:

  • No effective monitoring: Having no proper visibility at runtime means less time to solve issues when they occur. Developers are responsible for making their applications easy to monitor and for providing actionable metrics on their logs.

  • Software defects: A bad algorithm, a misunderstood requirement or lack of understanding of the underlying system can turn software from functional to dysfunctional in a speck of a second. It’s the responsibility of the developers to do proper code reviews, write tests and be on top of their solutions at any time.

  • Hardware failures: Hard drives or chips may burn or cease to work.

  • No redundancy: That includes network, storage or physical redundancy.

  • Cascading failures: A cascading failure is a process in a system of interconnected parts in which the failure of one or few parts can trigger the failure of other parts. For example, if a service goes down and the other services do not have a proper design to handle that failure, then their requests can propagate exponentially and cause an outage. One real-world example of this is the Amazon Elastic Block Storage incident. Developers need to write applications that can handle errors and timeouts gracefully, and that will always be resistant to failures.

Again, there are many other reasons why software could become unavailable. But for the rest of this article, we’ll be focusing on the above, because they are all problems that developers can help to address.

Cap Theorem and High Availability

Sooner or later, when developers and technical architects are designing their systems, they need to decide what sort of consistency model they want to apply to their application state in case of operation problems. There is a well-known theorem that helps guide this process, which is the Cap Theorem.

The Cap Theorem states that for a distributed data store, we cannot guarantee more than two of the following characteristics at the same time: consistency, availability and partition tolerance.

Network partitioning is an unavoidable parameter (engineers have to factor it into their solutions) and it can happen anytime, so the choice is consistency over availability. In terms of choosing availability, the system will always process the request and try to return the most recent available version of the information—and will not let itself shut down due to stale data or network error. In normal day-to-day operations, it affects the outcome when a network partition does occur, and it can affect the overall service-level agreement of the system.

In terms of achieving high availability, developers must design their applications so that every request is handled by at least one working service. This is primarily focused in the scope of software but there are equivalent terms that are related to hardware availability.

How can we achieve that in practice when developing APIs? In two main ways. First, the API service endpoints, while serving the client request, can detect if the backend servers are down or the network request has timed out by using service discovery or periodic heartbeats. They can then respond from their local cache if necessary.

Secondly, the API consumers (a.k.a, clients) can include the same caching strategy in their own applications by utilizing new technologies like service workers for performing a Cache, Update and Refresh strategy.

For example, see the following snippet written in Javascript:

self.addEventListener('fetch', function(evt) {
 // Cache
 evt.respondWith(fromCache(evt.request));
 evt.waitUntil(
   update(evt.request)
   .then(refresh)
 );
});

// Update
function update(request) {
 return caches.open(CACHE_KEY).then(function (cache) {
   return fetch(request).then(function (response) {
     // Refresh response
     return cache.put(request, response.clone()).then(function () {
       return response;
     });
   }).catch(function (error) {
     // In case of an error from the request we want to return the cached response
     return cache.match(request).then(function(response) {
       if (response) {
         return response;
       }
     }
 });
}

This code runs on the service worker, and for every request type (read or write), we serve from the cache while simultaneously requesting from the network. Even if the network is down and no servers are running, we would be able to serve cached results to the clients, thus maintaining the availability of the service to the clients. When the network becomes available, or with a background sync or push notifications (with some extra handlers to sort out any pending writes), then the cache will refresh with the latest results. (Check this page for more information about the waitUntil Web API.)

These types of high-level architecture decisions can affect the behavior of the system, and should be made as early in the design process as possible. When developers are working with their apps they need to be aware of all the capabilities and limitations of each approach and apply the best practices to stay on track.

Security Controls and High Availability

In addition to making sure our systems can handle uptime requirements, companies that move their business operations to the cloud need to manage them securely. Because cloud applications are always connected, security should not be left as an afterthought, as a lack of it can cause serious negative outcomes.

The mantra of security is confidentiality, integrity and availability, and there is a reason for that. In terms of availability in security, we refer to the practice of ensuring that authorized parties are able to access information when needed. This information needs to be accessed by the right people (authentication, authorization), and a proper record should be made (accounting), with the right usage quotas (throttling and quotas).

The inverse effect is denial of service, and it’s a common attack today. The primary aim of those attacks (DoS, DDoS) is to deny users of their ability to access websites and consume APIs, thus affecting the availability of the system.

Apart from the fact that those systems need protection using web application firewalls and proper backup strategies, developers are responsible for following all security best practices, performing regular code reviews and using the latest version of their libraries. In other words, proper security controls must be added to the development workflow and rigorously assessed by an external vendor within the scope of penetration testing and security.

Conclusion

To meet challenging end goals such as high availability, all parties are responsible. On the one side, system admins and operations teams must be on their toes and proactively watch their systems; on the other side, software developers and architects must make their best efforts to minimize software issues and thoroughly test their APIs against all possible negative scenarios.

We recognize that failures are inevitable, but if we try to do our best to contain the damage, then we will prevent failing services from bringing down our applications. As Ben Franklin said, “an ounce of prevention is better than a pound of cure.”

Categories: apis, monitoring

Everything is going to be 200 OK®