The rants of a certifiable geek: cloud

Showing posts with label cloud. Show all posts

Monday, August 18, 2014

Microsoft Azure Stumbles - Again

I recently wrote about being “In Search of a Highly Available Persistence Solution”. However, it seems that Azure has been experiencing other more sweeping outages and service degradations. PCWorld recently wrote about Microsoft’s woes, “Azure cloud services have a rough week”.

Today, we are nervously watching as the service issues continue. Luckily the outages have not affected our product … yet.

Microsoft is trying very hard to gain ground against rival Amazon. This certainly doesn’t help. To be fair, cloud outages are not uncommon and when they do occur, a lot of people get pissed off. Last year’s InfoWorld article shows that it happens to the best of them.

Monday, August 11, 2014

In Search of a Highly Available Persistence Solution

Our cloud-based SaaS offerings are hosted in Azure. Like many applications, some of our apps began life in the era before cloud computing became mainstream. As a result many design decisions had to be rethought. Some involved some rather extreme makeovers just to be able to run there – like removal of CLR stored procedures (gah!) since SQL Azure didn’t support it. Others were more fundamental to multi-tenant apps, but required change nonetheless.

Along the way, we have made numerous changes that brought higher performance, stability, scalability and reliability to the products. Frankly, the Azure compute infrastructure for PaaS is excellent. Combined with Windows Azure Traffic Manager our compute has performed admirably. Scaling is a snap. Deployment is beyond easy. Then we hit the snag. Persistence. State sucks. Compute is easy because it can be treated statelessly; but the persistence layer is another story. If your database goes belly up, you’re dead. Having an offline copy of your database could help, right? But then you think, “How recently did you make that copy?” Or, “What about data loss between backups? My user really wants to see that last transaction!” And once your original database is back online, you ask yourself, “How do you sync up the changes?” Wouldn’t it be great if that copy was a transactionally-consistent copy?

SQL Azure touts high availability in the datacenter through its replicas. Every database is actually implemented as a master and two replicas. Their stated goal is 99.9% availability. On paper, you might think that having two replicas of your data would be sufficient to give you high availability. Our experience is that it is not. They do not offer a cross-datacenter high availability option. In recent conversations with Microsoft, surprisingly they are pointing people towards SQL Server in IaaS for high availability. Their Tutorial: Database Mirroring for High Availability in Azure further supports this view. C’mon guys – you can do better than this! This is a cop out. I want a scalable, highly available, cloud-ready repository.

Edit: The new database SKUs for Azure SQL Database (basic, standard, and premium) maintain “at least three replicas”. Please refer to this overview of the new SKUs for more information. I want to point out that our current product deployments are using the GA SKUs (Web and Business) and not the new SKUs outlined in this overview.

Time for another design check. In reality, what your application must do to get closer to the magical 5 9’s of availability is to move replication logic into the application. If you’re contemplating more than one form of persistence (we are) and if you want high availability (we do), then be prepared to roll your own. A number of architectural design sessions and a review of the options available to the Azure ecosystem makes it clear that they don’t have a complete answer here. It’s not entirely surprising. They give you the primitive infrastructure and framework bits and you do the rest. In this case the platform affords you nothing leaving you to do the rest a.k.a. everything else…

Moreover, if you intend to be a persistence polyglot, your options are complicated. How many of you put everything in the database? Does that image really belong in there? How about those PDFs? How about your application configuration and set up information? Couldn’t blob storage do a much better job of saving that binary data? Couldn’t you use a document database to store that configuration data? Modern enterprise applications have a very diverse set of data and using a relational database for all of it means you’re fitting square pegs in round holes.

A number of design patterns can help you to abstract your application business logic from how you persist your data. You *are* using design patterns, aren’t you? A simple pattern to use is the Repository pattern. Done properly, you can completely hide the complexities of how to persist your data completely from the rest of your application leaving you the option of doing synchronous replication to one or more replica copies of your data, which can help to ensure you have a transactionally-consistent copy of your data. Or you may choose to use asynchronous replication for higher performance at the risk of some small amount of data loss. Both approaches require your application to detect failures and to switch from the master to the replica copy.

Other more complex design patterns can help you with this problem at the expense of complexity.Simplicity is king in software design. Embarking on an implementation of CQRS may leave you wondering why you started in a career of software development. But it can also leave you with a superior solution to a complex problem. Moreover, by implementing this in your application, you can decouple yourself from a platform and its requirements for high availability. Like everything in software development – it’s a balance of choices. Flexibility, simplicity, maintainability, speed to market, performance, etc.

Like some on my team are fond of saying, “If it were easy, everyone would play the game.”

Friday, November 22, 2013

Higher Education Adoption of the Cloud

Business Cloud News writes on the slow move to the cloud by higher education in "Ovum: Higher education lagging in cloud adoption". The focus in the article is on the use of Learning Management Systems for these institutions.

While I agree with the general premise of the article regarding the rate of adoption of the cloud by higher ed, I would argue that the use of cloud platforms will not just be driven by LMS usage, but will extend much further across their enterprises. Student information systems, housing systems, CRM, etc. will also drive them into the cloud - particularly where commodity platforms are much more cost effectively operated there.

Our customers are tech-savvy folks, but they are also pragmatic. The single biggest impediment to adoption, in my opinion, isn't the desire to move to the cloud, but is instead it's maturity. As the next generation of services are made available and stability and reliability go up, that's where you'll see the growth of adoption take off.

Bottom line: Until public cloud providers dramatically improve their product stability and make it a true value add you won't see the education sector move to it in large scale.

Thursday, November 21, 2013

Preparing for the Cloud

Summary
In this post I'm going to discuss three common pitfalls that you should be wary of when you are planning on putting your first application into the cloud. In this post, I'll be focusing on Windows Azure, but the points I make are equally applicable to other public cloud providers.

On-Premises Apps vs Cloud Apps
If you're like many people that have not yet put a production application into the cloud, then you may have a lot of questions about what that means. Developing apps for the cloud often uses tools and languages that you're already familiar with. However, the approach that you take to designing your app should be significantly different.

Probably the most significant way in which these two environments differ is the volatility of the services and infrastructure that you can depend on. Keep in mind that when you are running in a public cloud environment, you are running in a distributed, virtualized data center, and are sharing your resources with many other tenants. The effect of this can be quite severe in some cases.

Transient Errors
The cloud is a highly distributed environment. The resources that you depend on will likely not reside on the same host or even be in the same rack or data center that your app is running. Much of the infrastructure that supports a cloud provider's resources is highly redundant and configured for high availability. But that doesn't mean that blips in connectivity and availability don't occur. Sometimes those blips can extend long enough to become bleeps - usually from you at 2 a.m. when you get that support call.

An important part of your application design when building for the cloud is transient error handling. What exactly does this mean? At its simplest, it means that you need to add retry logic to your code that interacts with services that are external to your app. Which ones? ALL OF THEM. Candidates for this are caching, database, storage, and other services that are probably critical to your app. This can even mean your own services. Consider your web site's service agents when they call out to that SOAP or REST service to get or update data.

Take database connectivity as an example. Typical on-premises applications assume that the database connection will work when you ask for it. Its often considered a hard failure when you are unable to connect to the database. However, in the cloud, this is a situation that can happen regularly when the environment is under stress. Those connections are valuable and in order to maintain as much availability of resources as possible the environment will periodically go through and close connections in the pool. Transient error handling in this case means adding retry logic around opening your connection. A good practice to follow in this case is to employ sliding retry logic. Rather than just retrying over and over immediately after the previous attempt, it is advisable to put logic in to increase the amount of time you wait between retries. A Fibonacci or exponential scale can help. Set a reasonable maximum number of times to retry before your code gives up.

Service Outages
In spite of the best efforts of cloud providers to provide five 9's of service to you, the reality is something much less. Windows Azure currently runs at a rate of around three 9's of availability. What does this mean to you as the consumer of those services? It means you need to think redundancy. For instance, if your app has a requirement for high availability, you should consider building out your deployments to be in more than one data center. Windows Azure provides tools to help with this. Whether you are using Platform as a Service or Infrastructure as a Service VMs, it is very straightforward to deploy and configure multiple instances of your application functionality in more than one data center. Using a service such as Windows Azure Traffic Manager will allow you to stand up a highly available load balancer in front of your web application or web services quickly and easily. Depending on your needs, you can configure WATM to run in a fail over or round robin mode. The latter allows you to get some benefit out of your backup deployments. Don't forget to factor into your budget the increased cost for additional hosted services that you stand up for redundancy.

Oftentimes, failures in services in the cloud will only affect a portion of your application. This makes it extremely important to be able to fail over that portion of the app. Consider a traditional 3-tier application that has a web UI, services for business logic, and a SQL database. If your services tier experiences issues it is critical for your web application users that your UI be able to switch to another set of services. Just like using WATM to handle fail over for your web UI, you can use it to fail over for your services as well. Optionally, your application can handle the fail over to a redundant set of resources depending on your implementation. One layer in an application is particularly more difficult to fail over due to its statefulness. The resource layer in your app, which is often a SQL database, but can also involve file or other persistence mechanisms, is much more difficult to fail over properly. It often involves multi-write/read logic to properly handle near real time redundancy with logic to identify and manage a master and slave relationship between data providers. This is not a trivial exercise. Microsoft provides Data Sync for SQL Azure. It's window for replication is 5 minutes. This may or may not serve your needs. If you need something more granular than that, be prepared to design for it.

Throttling
Windows Azure and the services you have access to are multi-tenant. Throttling is a way of life in the cloud and is used to ensure that the environment is not overrun with load. IT administrators are very aware of the effects that a VM instance can have on its host when it is allowed to consume too many resources. It usually means resource starvation for other VMs on that host. Throttling limits are set by the cloud provider in their infrastructure in order to limit the effects of resource starvation. They are out of your control. The single most important thing to remember about throttling is that it is meant to protect the cloud provider's resources - not your app. The effects of throttling can be insignificant or they can be debilitating depending on the way your application is designed. In severe cases, resources that you depend on may be temporarily unavailable due to throttling. Just like service outages, throttling can appear to your app that a critical service is unavailable or experiencing transient errors. One way to mitigate this is to make the most efficient use of resources possible in this environment and to employ caching wherever possible to minimize the number of trips necessary to expensive resources such as the database.

Summary
This post provides a high-level view of some of the pitfalls that application developers can run into when deploying their application into the cloud. Follow-on posts will tackle these in detail and look at specific ways to mitigate them.