In our education system, we often look at the two key core disciplines of reading and
mathematics as being essential requirements in any successful career. Yet in both these disciplines, there are basic technical fundamentals that need to be mastered before one can be successful in either of these disciplines.
In the area of mathematics, one needs to master the fundamentals of addition, subtraction, multiplication, and division. In reading, the student needs to master the alphabet or ABC’s before he or she can even attempt to read a story.
The reason that I bring up this notion is that this type of process for mastering a particular discipline is no different than in attempting to master the world of analytics. In our world of analytics our ABC’s is DATA,DATA, and DATA. Analytics practitioners must not only have a conceptual understanding of data but a ‘ roll up your sleeves’ approach towards data. The ‘roll up your sleeves’ approach affords the analyst a much deeper understanding of the data as he/she is then better able to handle all the detailed nuances which is a fact of life in the data world. We have all heard the phrase ‘The Devil is in the Details’. Since data is detail, this phase is easily transformed into the phrase that I often use with clients ‘The Devil is in the Data’.
In any environment, be it the digital or offline arena, analyst need to have a process of better understanding data. How is this done? The first step is the so-called data audit which will be the focus of this discussion. The discussion of data and its importance in data mining and analytics does not end here since a more full blown exhaustive discussion could easily result in a book.
Regardless of whether the data represents new information that is unfamiliar to the analyst or even if it represents a familiar existing data source, a process such as a data audit process needs to take place that allows us to better understand the data at a detailed level. Let’s talk about the first situation where the data represents new information.
The first task in the data audit process is to obtain an initial glimpse of the data by observing a random set of 10 records and all their accompanying fields from a given file.
This simple task accomplished two objectives. First, it determines the fact that we have loaded the data correctly into the existing software analytics application which we are using. Secondly, we begin to determine our initial ‘feel’ of the data environment. In other words, are we dealing with a very simple data environment or one that is quite exhaustive. Looking at the number of files as well as the number of fields on each file along with examples of the values of these fields begins the analyst’s quest towards a very deep detailed understanding of the data environment.
After this initial glimpse, further data diagnostics are conducted which allow the analyst to determine the number of unique values and number of missing values for each field on each file that is received by the analyst. Further diagnostics such as frequency distributions allow the analyst to better understand the distribution of values within a given field.
With the information from these data audit reports, the analyst can then begin to determine how to create the analytical file which essentially represents key information in the form of analytical variables. The so-called objective in this exercise is to have all meaningful information at the appropriate record level where any proposed solution will be actioned. In most cases, this represents the customer or individual but is not necessarily confined to that level. For example, pricing models for auto insurance are actioned at the vehicle level while retail analytics might focus on decisions that are actioned at the store level. More about how data and information is used to create this all-important analytical file will be discussed in future blogs.
The data audit process should not only be conducted on new information or data sources but also on known data sources that serve as updated or refreshed information for a given analytical solution. Assuming that the refresh or updated data is fine because you have gone through the data audit process when you first received a new data source for the first time reminds me of a high school teacher’s comments about the word ASSUME. Using the word ASSUME means that U make an ASS out of ME. This phrase has extreme relevance in the data world in that any data(new or updated) that is being used to generate business solutions needs to have some checks and controls. Although the data audit process used in processing a new data source can be quite comprehensive, a less rigorous process may be used when assessing refresh or updated data. This may be simple as checking the number of records and fields that are received as well as some stock audit reports that look at means or averages of key variables which are unique for a given client. In any event, the purpose of even doing a shortened version of this data audit report is to establish a means of identifying problems or issues with the data that is currently being used by the analyst.
Data Audits, despite producing the least glamorous type of reporting information, are necessary prequisites in any analytics process. If one believes that all analysis starts with data, then one must exercise extreme diligence and respect for the data. This kind of attitude towards data helps to foster an appreciation of the many kind of data nuances that can appear in a project. If analytics is going to be successful, data audits represent the first critical task within any given project. Future discussions will outline the how this analytical file is created once the data audits are completed.