Monday, August 11, 2014

In Search of a Highly Available Persistence Solution

Our cloud-based SaaS offerings are hosted in Azure. Like many applications, some of our apps began life in the era before cloud computing became mainstream. As a result many design decisions had to be rethought. Some involved some rather extreme makeovers just to be able to run there – like removal of CLR stored procedures (gah!) since SQL Azure didn’t support it. Others were more fundamental to multi-tenant apps, but required change nonetheless.

Along the way, we have made numerous changes that brought higher performance, stability, scalability and reliability to the products. Frankly, the Azure compute infrastructure for PaaS is excellent. Combined with Windows Azure Traffic Manager our compute has performed admirably. Scaling is a snap. Deployment is beyond easy. Then we hit the snag. Persistence. State sucks. Compute is easy because it can be treated statelessly; but the persistence layer is another story. If your database goes belly up, you’re dead. Having an offline copy of your database could help, right? But then you think, “How recently did you make that copy?” Or, “What about data loss between backups? My user really wants to see that last transaction!” And once your original database is back online, you ask yourself, “How do you sync up the changes?” Wouldn’t it be great if that copy was a transactionally-consistent copy?

SQL Azure touts high availability in the datacenter through its replicas. Every database is actually implemented as a master and two replicas. Their stated goal is 99.9% availability. On paper, you might think that having two replicas of your data would be sufficient to give you high availability. Our experience is that it is not. They do not offer a cross-datacenter high availability option. In recent conversations with Microsoft, surprisingly they are pointing people towards SQL Server in IaaS for high availability. Their Tutorial: Database Mirroring for High Availability in Azure further supports this view. C’mon guys – you can do better than this! This is a cop out. I want a scalable, highly available, cloud-ready repository.

Edit: The new database SKUs for Azure SQL Database (basic, standard, and premium) maintain “at least three replicas”. Please refer to this overview of the new SKUs for more information. I want to point out that our current product deployments are using the GA SKUs (Web and Business) and not the new SKUs outlined in this overview.

Time for another design check. In reality, what your application must do to get closer to the magical 5 9’s of availability is to move replication logic into the application. If you’re contemplating more than one form of persistence (we are) and if you want high availability (we do), then be prepared to roll your own. A number of architectural design sessions and a review of the options available to the Azure ecosystem makes it clear that they don’t have a complete answer here. It’s not entirely surprising. They give you the primitive infrastructure and framework bits and you do the rest. In this case the platform affords you nothing leaving you to do the rest a.k.a. everything else…

Moreover, if you intend to be a persistence polyglot, your options are complicated. How many of you put everything in the database? Does that image really belong in there? How about those PDFs? How about your application configuration and set up information? Couldn’t blob storage do a much better job of saving that binary data? Couldn’t you use a document database to store that configuration data? Modern enterprise applications have a very diverse set of data and using a relational database for all of it means you’re fitting square pegs in round holes.

A number of design patterns can help you to abstract your application business logic from how you persist your data. You *are* using design patterns, aren’t you? A simple pattern to use is the Repository pattern. Done properly, you can completely hide the complexities of how to persist your data completely from the rest of your application leaving you the option of doing synchronous replication to one or more replica copies of your data, which can help to ensure you have a transactionally-consistent copy of your data. Or you may choose to use asynchronous replication for higher performance at the risk of some small amount of data loss. Both approaches require your application to detect failures and to switch from the master to the replica copy.

Other more complex design patterns can help you with this problem at the expense of complexity.Simplicity is king in software design. Embarking on an implementation of CQRS may leave you wondering why you started in a career of software development. But it can also leave you with a superior solution to a complex problem. Moreover, by implementing this in your application, you can decouple yourself from a platform and its requirements for high availability. Like everything in software development – it’s a balance of choices. Flexibility, simplicity, maintainability, speed to market, performance, etc.

Like some on my team are fond of saying, “If it were easy, everyone would play the game.”

No comments:

Post a Comment