Data Mining

Is Predictive Analytics for Marketers really that Accurate

How can a data mining professional say such blasphemous words. After all, doesn’t Big Data Analytics and the use of predictive analytics represent the panacea to our ever-expanding data universe. The many articles and whitepapers along with their accompanying vendors purport that this new frontier must be embraced by organizations to remain competitive in the 21^st century. Yet, like all emerging transformative disciplines, there needs to be a little of “Let’s step back” and attempt to “get under the hood of what predictive analytics can and cannot do” if expectations are going to be consistently managed.

Perhaps the biggest myth underlying predictive analytics is its ability to accurately predict outcomes. Intuitively, one would think that the level of accuracy would be high. But let’s really see what this means. Suppose we build a response model where the average response rate prior to the use of any predictive model is 2%. If this model is 100% accurate and is being deployed against a universe of 100000 names, we would expect that the top model-scored 2000 names would all be responders(100000 X .02). In reality, the experienced practitioner never observes this and if this indeed were the case, grave concerns would be considered regarding massive overstatement of the model. In fact, a more likely scenario would be that the model achieves a 6% response rate in the top 10% of model-scored names where 30% of all observed responders are classified correctly. Certainly, the number 30% is better than 10% which would be the case if the solution were completely random. Yet, the mathematical purist would still be correct in saying “30% of all responders is good but what about the other 70% of responders”. Indeed what about this other 70% and why do many practitioners live with these results and these constraints. The answer in one word is “NOISE.”

As any mathematician and statistician will tell you, our tools are about explaining trends and patterns and in effect reducing the noise. Consider the basic concept of multiple regression which attempts to interpret the performance or power of a model based on its ability to explain away this noise relative to the total noise in all the data. In the world of business and indeed marketing , this ability to truly explain away this noise is severely limited. It is not unusual to have models or tools that explain only 5% of all this noise with the result being that the model is performing well above average when compared to other models. Why do we accept these so-called “dismal” statistics such as 70% of responders not being accounted for and 95% of the noise not being explained.

The real issue when considering “NOISE” is identifying what is true noise versus noise that can be effectively explained away through a predictive model . Unfortunately, this is not something that a textbook in mathematics can explain. It is only by applying models in practice that one observes this huge disparity between what noise can be explained versus noise that cannot explained.

In the real world, predictive analytics is often used to predict outcomes where the observed “YES” cases are indeed much greater than the “NO” cases. For instance, response and defection models all operate under scenarios where the “yes” outcomes typically occur in less than 5 cases out of a hundred cases.. The extreme proportion of “no” outcomes by its very nature creates more noise. A good example of this is that the model’s ability to accurately predict outcomes increases as the rate increases to 50%. In fact, a good exercise is to try building models where the rate is close to 50% and you will observe that diagnostics explaining the power of the model increases when compared to models where the rate is small(5% or less). We actually attempted this exercise ourselves where we built a predictive model using the same data fields and the same target variable which in this case was response rate. Listed below is a table depicting our results.

	Response Rate Model
Response Rate of Sample	R²
0.60%	0.34%
2.76%	1.13%
5%	1.55%
50%	7.24%

As you can see the above results support our hypothesis that increasing response rate translates to increased model power or performance as demonstrated by R².

Fraud models present extreme challenges when building predictive models as there are often far too many no’s relative to yeses . One way of improving model performance is to stratify the sample where the practitioner keeps all the yeses but extracts a random sample of the no’s thereby increasing the overall rate of the yeses. Stratified sampling is widely accepted as a sound data mining practice when dealing with extremely large proportions of no’s.

Having talked about the ability to increase overall model performance thru increased rates(i.e. response, defection,etc), we observe that even in our experiment, the model performance or R²peaked at 7.24% with a response rate of 50%. But we can easily ask the question: “What about the other 92.7% that the model is not able to explain?. Is this true random noise or can we further explain some of this noise. The key in attempting to explain away more noise is to identify variables that exhibit non linear relationships with the target model variable , yet make sense regarding the behavior we are trying to predict and the particular business where these solutions are to be deployed. The use of pure mathematics without an understanding of the business could result in models that yield non linear type variables which are simply attempting to explain away true random noise. The notion of having several validation samples, one that is derived from the same analytical file as the development sample and another validation sample that is derived from a different time period, can certainly help to mitigate model overstatement caused by excessive non linear type variables. Yet, even in undergoing this type of rigor in enhancing our model performance, it is unlikely that we can reasonably expect models to explain more than 10% of the target behaviour.

If predictive analytics has limited powers in the marketing arena, then why should there be any enthusiasm for its resultant solutions. Forgetting the statistical limitations as seen above, it is the incremental ROI that is the huge benefit of predictive analytics. It is not unusual to achieve incremental $ benefits in excess of $100M per campaign. So as an old business mentor of mine used to say:

“Let’s ground the statistics to the business rather than the business to the statistics”

Data Mining

Tuesday, April 9, 2013

No comments: