Thursday, November 21, 2013

Preparing for the Cloud

Summary
In this post I'm going to discuss three common pitfalls that you should be wary of when you are planning on putting your first application into the cloud. In this post, I'll be focusing on Windows Azure, but the points I make are equally applicable to other public cloud providers.
On-Premises Apps vs Cloud Apps
If you're like many people that have not yet put a production application into the cloud, then you may have a lot of questions about what that means. Developing apps for the cloud often uses tools and languages that you're already familiar with. However, the approach that you take to designing your app should be significantly different.
Probably the most significant way in which these two environments differ is the volatility of the services and infrastructure that you can depend on. Keep in mind that when you are running in a public cloud environment, you are running in a distributed, virtualized data center, and are sharing your resources with many other tenants. The effect of this can be quite severe in some cases.
Transient Errors
The cloud is a highly distributed environment. The resources that you depend on will likely not reside on the same host or even be in the same rack or data center that your app is running. Much of the infrastructure that supports a cloud provider's resources is highly redundant and configured for high availability. But that doesn't mean that blips in connectivity and availability don't occur. Sometimes those blips can extend long enough to become bleeps - usually from you at 2 a.m. when you get that support call.
An important part of your application design when building for the cloud is transient error handling. What exactly does this mean? At its simplest, it means that you need to add retry logic to your code that interacts with services that are external to your app. Which ones? ALL OF THEM. Candidates for this are caching, database, storage, and other services that are probably critical to your app. This can even mean your own services. Consider your web site's service agents when they call out to that SOAP or REST service to get or update data.
Take database connectivity as an example. Typical on-premises applications assume that the database connection will work when you ask for it. Its often considered a hard failure when you are unable to connect to the database. However, in the cloud, this is a situation that can happen regularly when the environment is under stress. Those connections are valuable and in order to maintain as much availability of resources as possible the environment will periodically go through and close connections in the pool. Transient error handling in this case means adding retry logic around opening your connection. A good practice to follow in this case is to employ sliding retry logic. Rather than just retrying over and over immediately after the previous attempt, it is advisable to put logic in to increase the amount of time you wait between retries. A Fibonacci or exponential scale can help. Set a reasonable maximum number of times to retry before your code gives up.
Service Outages
In spite of the best efforts of cloud providers to provide five 9's of service to you, the reality is something much less. Windows Azure currently runs at a rate of around three 9's of availability. What does this mean to you as the consumer of those services? It means you need to think redundancy. For instance, if your app has a requirement for high availability, you should consider building out your deployments to be in more than one data center. Windows Azure provides tools to help with this. Whether you are using Platform as a Service or Infrastructure as a Service VMs, it is very straightforward to deploy and configure multiple instances of your application functionality in more than one data center. Using a service such as Windows Azure Traffic Manager will allow you to stand up a highly available load balancer in front of your web application or web services quickly and easily. Depending on your needs, you can configure WATM to run in a fail over or round robin mode. The latter allows you to get some benefit out of your backup deployments. Don't forget to factor into your budget the increased cost for additional hosted services that you stand up for redundancy.
Oftentimes, failures in services in the cloud will only affect a portion of your application. This makes it extremely important to be able to fail over that portion of the app. Consider a traditional 3-tier application that has a web UI, services for business logic, and a SQL database. If your services tier experiences issues it is critical for your web application users that your UI be able to switch to another set of services. Just like using WATM to handle fail over for your web UI, you can use it to fail over for your services as well. Optionally, your application can handle the fail over to a redundant set of resources depending on your implementation. One layer in an application is particularly more difficult to fail over due to its statefulness. The resource layer in your app, which is often a SQL database, but can also involve file or other persistence mechanisms, is much more difficult to fail over properly. It often involves multi-write/read logic to properly handle near real time redundancy with logic to identify and manage a master and slave relationship between data providers. This is not a trivial exercise. Microsoft provides Data Sync for SQL Azure. It's window for replication is 5 minutes. This may or may not serve your needs. If you need something more granular than that, be prepared to design for it.
Throttling
Windows Azure and the services you have access to are multi-tenant. Throttling is a way of life in the cloud and is used to ensure that the environment is not overrun with load. IT administrators are very aware of the effects that a VM instance can have on its host when it is allowed to consume too many resources. It usually means resource starvation for other VMs on that host. Throttling limits are set by the cloud provider in their infrastructure in order to limit the effects of resource starvation. They are out of your control. The single most important thing to remember about throttling is that it is meant to protect the cloud provider's resources - not your app.  The effects of throttling can be insignificant or they can be debilitating depending on the way your application is designed. In severe cases, resources that you depend on may be temporarily unavailable due to throttling. Just like service outages, throttling can appear to your app that a critical service is unavailable or experiencing transient errors. One way to mitigate this is to make the most efficient use of resources possible in this environment and to employ caching wherever possible to minimize the number of trips necessary to expensive resources such as the database.
Summary
This post provides a high-level view of some of the pitfalls that application developers can run into when deploying their application into the cloud. Follow-on posts will tackle these in detail and look at specific ways to mitigate them.

No comments:

Post a Comment