Friday, April 20, 2012

Preparing for Predictive Analytics - Data is the Key

The Basics

As I mentioned in my last post, I’ll be making a series of posts on some of the challenges that you will face when embarking on a predictive analytics project. In this post, I’m going to focus on what may be obvious to most, but frequently has proven to be a challenge for the customers we have worked with. Namely having ready access to the required data.

Your organization’s data is the key to a successful predictive analytics project. Quality historical data is required in order to build models that will let you make predictions about the likelihood of some event or behavior. Not all data is relevant in all modeling scenarios, but generally the more information you have the better. The modeling exercise will weed out the noise from the signal. Some modeling techniques are better than others at dealing with the noise as well. Familiarity with the capabilities of your tool is very important in this context.

Cataloging Your Data

How many of you know all the different kinds of data used in your company? My experience has been that most organizations have a lot of data in silos that are not well documented and certainly not well integrated with other corporate data. This data can range from duplicate customer data, to sales data, to communication data such as email logs, marketing data, or other data needed to GSD (Get Stuff Done). Master data management projects can help your organization centralize, de-duplicate, and cleanse that data, but the reality is that these kinds of projects can take years to complete and are very complex. Minimally, it would be helpful to begin with a data cataloging project to at least get your arms around the data that your organization has. Start with the basics. Identify the types of data you have and who is responsible for maintaining it. Make note of where the data lives; i.e., what tool/platform was the system developed in. Besides laying the ground work for a master data management project down the road, this will be extremely valuable in your predictive analytics projects because it will outline where you need to go to get the information you want to model.

Does It Matter How the Data is Persisted?

The answer to this is highly dependent upon the tool or platform you are using in order to create your models and then to subsequently score the data. If you are using a commercial tool, your choices are limited by what that tool needs. Generally, the best answer is to bring the data together into a consistent storage medium. Whether that’s a relational database, a data warehouse, flat files, or XML files, the important thing is that the data can be accessed and interrogated as a set. Efficient set based operations are critical to the performance of your analytics solution. Many of the modeling activities will involve slicing, dicing, counting, aggregating, and transforming your data set in numerous ways. This can be a very slow process if your data is not stored in a way that supports those kinds of activities.

Analytics Repositories

My recommendation is that whenever possible you should try to collect your data into a centralized analytics repository. With smaller data sets this is much more approachable and is a common way to do it. However, with very large scale enterprise data this can be expensive, time consuming, and impractical. The time to update the repository can make timely model analysis impossible especially if you are trying to model transactional data that is quickly being added or changed.

Wrapping Up

In my next post, I’ll be discussing two different approaches to designing analytics repositories that address these two scenarios. The first approach is to use the simpler relational repository to store the data set. The second approach is to use a virtualized metadata-driven repository, which can be extremely useful in the larger scale enterprise settings.