Data Mining: 11/01/2011

Organizations are now aware of the significant contribution that predictive analytics brings to the bottom line. Solutions, once viewed with skepticism and doubts, are now embraced by many organizations. Yet, from a practitioner’s perspective, certain solutions are easier to understand than others. The challenge for the practitioner is provide a level of communication to the business end user such that the end user has the requisite degree of knowledge when using the solution. What does this mean? From the business end user’s standpoint, this means two things:

1) The ability to use the solution in order to attain its intended business benefits

2) Understanding the Business Inputs

The ability to use the solution in order to attain its intended business benefits

The use of decile reports and gains charts have been discussed in many data mining articles and papers and do provide the necessary information in making business decisions when using these solutions. The ability to rank order records with a given solution yields tremendous flexibility in understanding the profit implications under certain business scenarios. As businesses increasingly adopt predictive analytics, decile reports and gains charts are being recognized as the critical solution outputs in providing the necessary information for key business decisions.

Understanding the business inputs

This area represents the most challenging area as it is here that the analyst can really ‘get under the hood of a solution ’ and actually demonstrate what are the key business inputs to the model. Yet, as the world of predictive analytics has evolved , so have the techniques. Highly advanced mathematical techniques using approaches related to artificial intelligence, which imply the ability to detect non linear patterns, present real challenges in trying to explain the key components and triggers of a model. What do we mean by this?. Suppose I have a distribution where a variable like tenure has a tri-modal distribution(3 mode points within the distribution) with response. As the practitioner, how do I meaningfully explain the trend to the business user other than telling him or her to simply rely on the output as presented in the overall equation. Yet in many of these advanced techniques, there are a number of variables which exhibit this high degree of non-linear complexity. With some of these advanced techniques and software, equations are not even the solution output. Instead the practitioner is presented with output in the form of business rules. Further complicating this scenario is the fact that many of these non-linear variables may have interaction with each other whereby the interaction between multiple non-linear variables represents an actual input to the solution. These type of complex inputs pose real difficulties to the practitioner if he is trying to educate the end users on what the model or solution is comprised of. The practitioner’s typical response is that the solution is a ‘BLACK BOX’.

Do I believe the business inputs?

Yet, besides the communication challenge of ‘black box’ solutions, the second challenge is believability. Will the business community really believe that the equations and variables as seen in a black box solutions are simply superior in explaining variation than in the more traditional linear and log-linear techniques. Mathematically, we may be able to explain how a given input best explains the variation. But can we explain it in business sense. For example, if there is a linear relationship between tenure and response, it is far easier to explain that higher tenured people are more likely to respond thereby explaining why tenure has a positive coefficient within a given model equation . Conversely, assuming a linear relationship between income and response, we may conclude that higher income people are less likely to respond thereby explaining the negative coefficient of income within the overall model equation. But the ability to explain the impact of tenure on response becomes more difficult if the trend is a curvilinear one with multiple modes within the distribution. Tenure in this case would be captured in some kind of complex polynomial function that is very difficult to explain in business terms. This situation can even be more complicated if the solution is derived from machine learning as the resulting business rules do not look at trying to explain an overall trend of tenure with response. Instead, the tenure relationship is captured by attempting to optimize its relationship with response along particular points of a distribution.

With these type of black box solutions, user-driven parameters provide the necessary flexibility to the practitioner in delivering a variety of different solutions. The practitioner can alter the parameters in such a way as to deliver an almost ‘perfect’ solution since these these high-end mathematical solutions will attempt to explain all the variation even the variation which is truly random. Of course, this is the fundamental problem with ‘black box’ solutions and hence the dire need for validation.

Does it matter that black box solutions are difficult to understand?

Many pundits will state that it does not matter if a proper validation environment has been created. There is merit to this argument in that all predictive analytics practitioners are ‘scientists’ at heart. So the use of more advanced techniques to potentially deliver superior solutions just makes scientific sense. Yet, putting on my practical business hat, one could argue that the use of these highly advanced techniques are being deployed in environments where the random or error component of variation is very large. The consequence to this is that the level of variation that can be truly explained is quite small which is the reason why linear and logistic techniques in most cases will work just as well as the more advanced techniques. Of course, the key in really assessing the superiority of one solution over another is validation and the resulting gains charts where one can observe how well one solution predicts behaviour over another solution. The notion of a robust validation environment is one area where practitioners and academics are in complete agreement.

Yet, this inability to explain the key business inputs can present a barrier to broader acceptance of using these tools throughout the organization. Businesses need to have this so-called ‘deeper’ understanding because of the measurement process used to evaluate the performance of a given solution. This process cannot occur effectively if there is a gap in understanding what went into the solution. Of course, this is not a problem if the solution works perfectly. But what happens when solutions do not work. Under these scenarios, deeper forensics is required to understand what worked and what did not work which certainly implies a deeper understanding of the key business inputs. This so-called comprehension gap in being able to effectively measure black box solutions is the reason why most organizations opt for the simpler linear or logistic type solutions. In this type of environment, the notion of ‘less is more’ is very applicable here since these non black box solutions are more easily understood and most importantly can be analyzed in a very detailed manner particularly when these solutions fail to produce the expected results.

Data Mining

Wednesday, November 16, 2011

Black Box Analytics