Data Mining: 2011

Tuesday, December 20, 2011

Social Media Analytics: Three Perspectives

The analytics discipline within this channel can comprise a number of different perspectives. These three perspectives are:

network analysis
marketing attribution
text mining

In the area of network analysis, the analytics approach is somewhat similar to what is done within the telecommunication sector. There are often many deliverables that are related to this kind of analysis within this sector. But one such deliverable is the ability to identify influencers vs. followers. With this kind of knowledge, the telco can perform other analytics such as customer profiling in order to determine what are the key telco customer characteristics that differentiate influencers vs. followers.

This approach can be mimicked in the social media world as each individual has a network of contacts just like each telco customer has a network of calls that that have been made within a given period of time. The key in determining whether an individual is an influencer vs. a follower is through examination of both the breadth and depth of that individual’s network. Breadth refers to the number of contacts within the person’s network while depth refers to the level of engagement that each person has with all their contacts. Various metrics and indices can be calculated that allow the analyst to create an ‘influencer score’. A component of this type of analytics exercise would be to determine the threshold or benchmark from these metrics/indices that classifies someone as an influencer vs. a follower. This, of course, will vary from exercise to exercise.

The second area of marketing attribution attempts to determine how much ROI from a given marketing campaign can be attributed to social media. There may be no right answer or panacea to this challenge. Unless there is some direct method whereby a given person logs onto a certain web page and registers for a given product or service within the social media campaign, it is very difficult to determine direct dollars that can be attributed to social media. Yet, even this functionality exists, one could argue that the non registrants participating in some fanbook contest might be contributing to ROI by purchasing at a store, through TV or direct mail. These same kind of challenges exist for the mass advertising industry except the key advantage with social media is that we do have the denominator in determining the number of people that clicked onto some fan page or contest. Given these limitations, all analysts can do is to provide general direction in whether the campaign created additional engagement and whether or not this additional engagement translated into additional dollars. On the engagement side, metrics such as # of people logging onto a fan page or contest plus how long they are on the site can represent some kind of proxy and comparisons of these metrics can be made to other social media campaigns to determine the level of success. By the same token, anecdotal analysis of social media campaigns overtime can determine the overall impact on incremental sales. Classification of campaign periods into high, medium, and low social media marketing allow executives to see the general impact of social media marketing on sales. However, one cannot directly attribute the ROI back to a specific social media campaign. All we can say is that the campaign generated an improvement in engagement relative to other campaigns. The leap of faith for executives is that this improvement in engagement translates to incremental dollars but we don’t know the precise number.

The third area of text mining presents the most exciting opportunities for marketers in being able to actually analyze what is being said within this medium. The ability to identify key themes or topics as well as sentiment allows marketers to craft communication stategies that better address what is being discussed within this space. At an even higher level, brand strategies can be created that better reflect the needs and wants of its core customers. As an analytics practitioner, most of our deliverables end up producing marketing solutions that provide better targeting of customers. The traditional deliverables of analytics have always yielded solutions that are of a more tactical nature. Text mining in the social media space offers the analyst opportunities to build solutions that are of a more strategic nature which ultimately will have higher profile within any organization. The strategic nature of these type of solutions just serves to reinforce the ever-growing importance of analytics as a key business discipline and more importantly a key competitive advantage within a very dynamic business environment.

Wednesday, November 16, 2011

Black Box Analytics

Organizations are now aware of the significant contribution that predictive analytics brings to the bottom line. Solutions, once viewed with skepticism and doubts, are now embraced by many organizations. Yet, from a practitioner’s perspective, certain solutions are easier to understand than others. The challenge for the practitioner is provide a level of communication to the business end user such that the end user has the requisite degree of knowledge when using the solution. What does this mean? From the business end user’s standpoint, this means two things:

1) The ability to use the solution in order to attain its intended business benefits

2) Understanding the Business Inputs

The ability to use the solution in order to attain its intended business benefits

The use of decile reports and gains charts have been discussed in many data mining articles and papers and do provide the necessary information in making business decisions when using these solutions. The ability to rank order records with a given solution yields tremendous flexibility in understanding the profit implications under certain business scenarios. As businesses increasingly adopt predictive analytics, decile reports and gains charts are being recognized as the critical solution outputs in providing the necessary information for key business decisions.

Understanding the business inputs

This area represents the most challenging area as it is here that the analyst can really ‘get under the hood of a solution ’ and actually demonstrate what are the key business inputs to the model. Yet, as the world of predictive analytics has evolved , so have the techniques. Highly advanced mathematical techniques using approaches related to artificial intelligence, which imply the ability to detect non linear patterns, present real challenges in trying to explain the key components and triggers of a model. What do we mean by this?. Suppose I have a distribution where a variable like tenure has a tri-modal distribution(3 mode points within the distribution) with response. As the practitioner, how do I meaningfully explain the trend to the business user other than telling him or her to simply rely on the output as presented in the overall equation. Yet in many of these advanced techniques, there are a number of variables which exhibit this high degree of non-linear complexity. With some of these advanced techniques and software, equations are not even the solution output. Instead the practitioner is presented with output in the form of business rules. Further complicating this scenario is the fact that many of these non-linear variables may have interaction with each other whereby the interaction between multiple non-linear variables represents an actual input to the solution. These type of complex inputs pose real difficulties to the practitioner if he is trying to educate the end users on what the model or solution is comprised of. The practitioner’s typical response is that the solution is a ‘BLACK BOX’.

Do I believe the business inputs?

Yet, besides the communication challenge of ‘black box’ solutions, the second challenge is believability. Will the business community really believe that the equations and variables as seen in a black box solutions are simply superior in explaining variation than in the more traditional linear and log-linear techniques. Mathematically, we may be able to explain how a given input best explains the variation. But can we explain it in business sense. For example, if there is a linear relationship between tenure and response, it is far easier to explain that higher tenured people are more likely to respond thereby explaining why tenure has a positive coefficient within a given model equation . Conversely, assuming a linear relationship between income and response, we may conclude that higher income people are less likely to respond thereby explaining the negative coefficient of income within the overall model equation. But the ability to explain the impact of tenure on response becomes more difficult if the trend is a curvilinear one with multiple modes within the distribution. Tenure in this case would be captured in some kind of complex polynomial function that is very difficult to explain in business terms. This situation can even be more complicated if the solution is derived from machine learning as the resulting business rules do not look at trying to explain an overall trend of tenure with response. Instead, the tenure relationship is captured by attempting to optimize its relationship with response along particular points of a distribution.

With these type of black box solutions, user-driven parameters provide the necessary flexibility to the practitioner in delivering a variety of different solutions. The practitioner can alter the parameters in such a way as to deliver an almost ‘perfect’ solution since these these high-end mathematical solutions will attempt to explain all the variation even the variation which is truly random. Of course, this is the fundamental problem with ‘black box’ solutions and hence the dire need for validation.

Does it matter that black box solutions are difficult to understand?

Many pundits will state that it does not matter if a proper validation environment has been created. There is merit to this argument in that all predictive analytics practitioners are ‘scientists’ at heart. So the use of more advanced techniques to potentially deliver superior solutions just makes scientific sense. Yet, putting on my practical business hat, one could argue that the use of these highly advanced techniques are being deployed in environments where the random or error component of variation is very large. The consequence to this is that the level of variation that can be truly explained is quite small which is the reason why linear and logistic techniques in most cases will work just as well as the more advanced techniques. Of course, the key in really assessing the superiority of one solution over another is validation and the resulting gains charts where one can observe how well one solution predicts behaviour over another solution. The notion of a robust validation environment is one area where practitioners and academics are in complete agreement.

Yet, this inability to explain the key business inputs can present a barrier to broader acceptance of using these tools throughout the organization. Businesses need to have this so-called ‘deeper’ understanding because of the measurement process used to evaluate the performance of a given solution. This process cannot occur effectively if there is a gap in understanding what went into the solution. Of course, this is not a problem if the solution works perfectly. But what happens when solutions do not work. Under these scenarios, deeper forensics is required to understand what worked and what did not work which certainly implies a deeper understanding of the key business inputs. This so-called comprehension gap in being able to effectively measure black box solutions is the reason why most organizations opt for the simpler linear or logistic type solutions. In this type of environment, the notion of ‘less is more’ is very applicable here since these non black box solutions are more easily understood and most importantly can be analyzed in a very detailed manner particularly when these solutions fail to produce the expected results.

Monday, October 3, 2011

The ABC's of Analytics

In our education system, we often look at the two key core disciplines of reading and

mathematics as being essential requirements in any successful career. Yet in both these disciplines, there are basic technical fundamentals that need to be mastered before one can be successful in either of these disciplines.

In the area of mathematics, one needs to master the fundamentals of addition, subtraction, multiplication, and division. In reading, the student needs to master the alphabet or ABC’s before he or she can even attempt to read a story.

The reason that I bring up this notion is that this type of process for mastering a particular discipline is no different than in attempting to master the world of analytics. In our world of analytics our ABC’s is DATA,DATA, and DATA. Analytics practitioners must not only have a conceptual understanding of data but a ‘ roll up your sleeves’ approach towards data. The ‘roll up your sleeves’ approach affords the analyst a much deeper understanding of the data as he/she is then better able to handle all the detailed nuances which is a fact of life in the data world. We have all heard the phrase ‘The Devil is in the Details’. Since data is detail, this phase is easily transformed into the phrase that I often use with clients ‘The Devil is in the Data’.

In any environment, be it the digital or offline arena, analyst need to have a process of better understanding data. How is this done? The first step is the so-called data audit which will be the focus of this discussion. The discussion of data and its importance in data mining and analytics does not end here since a more full blown exhaustive discussion could easily result in a book.

Regardless of whether the data represents new information that is unfamiliar to the analyst or even if it represents a familiar existing data source, a process such as a data audit process needs to take place that allows us to better understand the data at a detailed level. Let’s talk about the first situation where the data represents new information.

The first task in the data audit process is to obtain an initial glimpse of the data by observing a random set of 10 records and all their accompanying fields from a given file.

This simple task accomplished two objectives. First, it determines the fact that we have loaded the data correctly into the existing software analytics application which we are using. Secondly, we begin to determine our initial ‘feel’ of the data environment. In other words, are we dealing with a very simple data environment or one that is quite exhaustive. Looking at the number of files as well as the number of fields on each file along with examples of the values of these fields begins the analyst’s quest towards a very deep detailed understanding of the data environment.

After this initial glimpse, further data diagnostics are conducted which allow the analyst to determine the number of unique values and number of missing values for each field on each file that is received by the analyst. Further diagnostics such as frequency distributions allow the analyst to better understand the distribution of values within a given field.

With the information from these data audit reports, the analyst can then begin to determine how to create the analytical file which essentially represents key information in the form of analytical variables. The so-called objective in this exercise is to have all meaningful information at the appropriate record level where any proposed solution will be actioned. In most cases, this represents the customer or individual but is not necessarily confined to that level. For example, pricing models for auto insurance are actioned at the vehicle level while retail analytics might focus on decisions that are actioned at the store level. More about how data and information is used to create this all-important analytical file will be discussed in future blogs.

The data audit process should not only be conducted on new information or data sources but also on known data sources that serve as updated or refreshed information for a given analytical solution. Assuming that the refresh or updated data is fine because you have gone through the data audit process when you first received a new data source for the first time reminds me of a high school teacher’s comments about the word ASSUME. Using the word ASSUME means that U make an ASS out of ME. This phrase has extreme relevance in the data world in that any data(new or updated) that is being used to generate business solutions needs to have some checks and controls. Although the data audit process used in processing a new data source can be quite comprehensive, a less rigorous process may be used when assessing refresh or updated data. This may be simple as checking the number of records and fields that are received as well as some stock audit reports that look at means or averages of key variables which are unique for a given client. In any event, the purpose of even doing a shortened version of this data audit report is to establish a means of identifying problems or issues with the data that is currently being used by the analyst.

Data Audits, despite producing the least glamorous type of reporting information, are necessary prequisites in any analytics process. If one believes that all analysis starts with data, then one must exercise extreme diligence and respect for the data. This kind of attitude towards data helps to foster an appreciation of the many kind of data nuances that can appear in a project. If analytics is going to be successful, data audits represent the first critical task within any given project. Future discussions will outline the how this analytical file is created once the data audits are completed.