Today’s author, Gabhan Berry, a Program Manager on the Excel team. A copy of the spreadsheet discussed in this post can be found as an attachment at the bottom of this post.
When faced with a table of data, what kind of techniques do we employ to extract information from it? Maybe the first thing we do is create summaries. We might total up the columns or we might chart the data. Perhaps, we might even create a PivotTable and drill down into subtotals.
All in all, summarizing data is useful. It’s easy to understand and easy to do.
Let’s say that we’re running a store and we have a small dataset of the sales we made in the first quarter. Using Excel we can easily create summary reports telling us how many products we sold to different types of people; how many to men and how many to women; how many to people over 40 and how many to people under 40.
All these totals are useful. We can compare them and search for interesting differences. We might notice that we sold more products to men than we did to women or that managers bought more of our products than other professions.
But can we get more information than this?
Did managers buy an unusually high number of products compared to other professions? Is this number so high that it might indicate a bias of some kind?
The answers to these types of questions enable us to make better business decisions. Perhaps if we know that our products are significantly more attractive to managers we might change our advertising policies to target managers more than other professions.
You may think that we need data warehouses and data mining tools to answer these types of questions. But that’s not necessarily true. Data warehousing and data mining are excellent tools, but so is Excel and Excel contains some sophisticated statistical functions which we can use to help us answer these types of questions.
Understanding the Basics
First of all, let’s take a look at our data set.
Our data contains customer information which we have gathered over the last quarter of trading. Each time we made a sale we recorded certain information about the customer. For instance, we recorded their Marital Status, their Gender, their Education level and their Occupation (amongst other fields). We have 1,000 records in our data set.
Now let’s say that we are interested in analyzing the Occupation field. It would be useful to know whether or not certain professions tend to buy more of our products than other professions. This would be a key piece of business intelligence for us.
But how do we decide this?
First of all, let’s build some summaries. Let’s count how many sales we made to each profession and each profession’s percentage of the overall sales and display these numbers in a table. We’ll also insert a pie chart to give some simple, visual indication of the proportions sold to each profession.
With this done, let’s now ask ourselves the question again: Do certain professions buy more of our products than other professions?
We can see that Professional and Skilled Manual occurred more frequently in the data than the other professions and that Manual occurred the least. But is this enough information to enable us to generalize and state that Professional and Skilled Manual buy more of our products in general than the other professions?
Well, not really.
Why? Because before we can state anything about these totals we need to have some benchmark to compare them to. We need to have some way of calculating expected numbers for each profession. If a profession had more sales than the expected number then that takes us one step closer to being able to state that there is an unexpected difference between the amounts of products sold to each profession.
To determine the expected values we have to think a little about probability.
For any sale, what is the probability of the occupation being Professional? There are 5 occupations and each sale can only be assigned one occupation so the probability of it being Professional is 1 in 5 i.e., 20%. In fact the probability for each occupation is 20%, assuming there is no bias in the data.
So, for any set of unbiased data, we would expect 20% to be Professional, 20% Clerical, 20% Management, 20% Manual and 20% Skilled Manual.
Thus, for 1000 records, we would expect each occupation to have 200 sales.
But common sense tells us that we’re very unlikely to get exactly 200 sales for each profession. So what we really mean is that we expect to get approximately 200 sales for each profession.
Looking at the actual data, we see that Clerical appears in 177 of the 1000 records. That’s a difference of -23 from the expected value. Is this close enough to 200 to be considered approximately 200?
Skilled Manual has 255 records; a difference of +55. Is this difference big enough to be significantly different?
In our data there are 255 Skilled Manual records. What is the probability of this? That is, what is the probability of there being 255 Skilled Manual records out of 1000 when each record could have any one of five occupations?
Also, since we expect 200 Skilled Manual records it seems logical to also expect that the probability of having 200 records is higher than the probability of having any other number. We’d also expect that the probability of getting 205 records would be higher than the probability of getting 220 (because 205 is closer to 200 than 220 is).
Thus, as we move away from 200 we would expect the probabilities to decrease and decrease towards 0.
Finally, we would expect that there would come a point when the probabilities are so small it would be extremely unlikely for these values to occur in unbiased data. Therefore, if these values did occur then our data has a high likelihood of being biased.
But, how do we calculate these probabilities?
We calculate them using something called the binomial distribution.
Using the Binomial Distribution
Data is expected to have a binomial distribution if:
- There are a finite number of records;
- The value of one record is not influenced by the value of any other record;
- Each record has a value of either true or false;
- The probability of a record having a value of true is the same for each record;
At first, it doesn’t seem like our data has these properties. But it does – we just need to look at it slightly differently.
We’re interested in the Occupation field and we know that there are five occupations. But instead of considering all five occupations at once, let’s consider just one occupation at a time. Let’s start with Skilled Manual.
We have a finite number of records (1000); the occupation of one record is not influenced by any other record; the probability that the record has a value of Skilled Manual is always the same (20%); and each record has an occupation of either Skilled Manual or not Skilled Manual.
Thus, the Skilled Manual data exhibits the required properties and therefore we expect that the Skilled Manual data is binomially distributed.
Using the same principles we can see that the same is true for the remaining four occupations. Instead of considering all five occupations at once, we consider them one at a time. Each occupation should be binomially distributed. If we find an occupation that isn’t, then there is a likelihood of bias in our data.
Excel has a function called BINOMDIST which calculates the probability of getting X true values out of a fixed number of records where the probability of each record being true has a certain, fixed value. In other words, BINOMDIST calculates the binomial probability.
For example, we can calculate the probability of getting 150 true records out of 1000 where each record has a 20% probability of being true by doing:
=BINOMDIST(150, 1000, 0.2, FALSE)
To illustrate how the probabilities are distributed (i.e., where the high values are and where the low values are) let’s calculate the BINOMDIST for each value between 1 and 1000. In other words, we will calculate the probability of getting 1 true record out of 1000, 2 true records, 3 true records etc. We’ll then plot these values on a chart.
Consider the following chart.
Notice that the highest value (the peak of the graph) is at 200? Also notice that as we move away from 200 the probabilities decrease towards 0?
Looking at the chart we can state that, if our data is binomially distributed, we are extremely unlikely to have 150 or fewer records for any occupation. Therefore, if we indeed have fewer than 150 records for any occupation it implies that our data is unlikely to be binomially distributed.
Notice the careful use of unlikely in the previous sentence. When dealing with probabilities we can rarely be 100% certain of anything. Instead we state facts as having a certain likelihood of being true. This is called a confidence level.
The confidence level is the final tool we need in order to conclude whether our data is biased towards certain occupations or whether the differences between occupations occur purely by chance.
Before we draw conclusions about our data we need to decide what confidence level we are happy with. A 95% confidence level means that there is a 5% chance that we will draw the wrong conclusion; a 99% confidence level means that there is a 1% chance we will get it wrong.
How confident do we need to be?
The confidence level is something that we need to decide for ourselves. Typically it is 95% or 99%. For the purposes of this article, let’s stick with 95%.
Now that we have our confidence level we can calculate what’s called the critical values of the binomial distribution.
We will calculate two critical values. These are:
- Lower bound: the smallest number of true records for a binomially distributed variable (at a specific confidence level);
- Upper Bound: the largest number of true records for a binomially distributed variable (at a specific confidence level);
We calculate the critical values using the CRITBINOM function.
To calculate the lower bound we do:
=CRITBINOM(1000, 0.2, 0.025)
This returns a value of 176.
To calculate the upper bound we do:
=CRITBINOM(1000, 0.2, 1-0.025)
This returns a value of 225.
Notice that the critical values depend on the total number of records in our data (1000), the probability each record has of being true (20% or 0.2) and the confidence level we want to work with (0.5/2 = 0.025). If these values change, the critical values change. So, for example, if we had 2000 records instead of 1000, the critical values would be different.
Also notice that we specify the confidence level as an error level. That is, we don’t supply a 0.95 confidence level we supply a 0.05 error level.
What’s more, we are performing what’s called a two-tailed test which means that we are looking for differences that are significantly lower or significantly higher than the expected value. Therefore we have to halve our error level from 0.05 to 0.025.
So now we have the bounding values. These enable us to determine whether or not our data is binomially distributed.
We expect each occupation to occur 200 times out of 1000. These critical values tell us how close to 200 the actual numbers of occurrences have to be in order for them to be considered close enough.
Thus, if the number of records for a given occupation is between 176 and 225, that occupation has a 95% likelihood of being binomially distributed. Otherwise, there is 95% likelihood that the occupation is not binomially distributed.
If we discover that our data is not binomially distributed then something other than chance is determining the data.
Producing the Report
Armed with this knowledge, we are ready to build our report for the Occupation field. Let’s restate the question we’re asking ourselves: Do certain professions buy more of our products than other professions?
Here’s the report:
The Count column simply uses COUNTIF to count the number of occurrences of each occupation. We have added some conditionally formatted data bars to offer some visual indication of the distributions.
The Score column also uses conditional formatting. An up arrow means that that the occupation is more frequent than expected; a horizontal arrow means that the occupation occurs the expected number of times (i.e., is binomially distributed) and a down arrow means the occupation occurs less frequent than expected.
The formula in the Score column is as follows (use the color highlights to interpret the cell references):
That is, the formula compares the count to the lower and higher bounds and returns -1 (less than lower bound), 0 (between lower bound and higher bound) or 1 (greater than higher bound).
The functions in the Lower Bound and Higher Bound cells use CRITBINOM as before.
Here’s the function in the Lower Bound cell:
Interpreting the Results: What Does this Really Mean?
Let’s look again at the table of results:
According to this analysis, only Clerical is binomially distributed. All the other occupations are not. So, what conclusions can we draw from this? What does this analysis really tell us?
Let’s take Skilled Manual as an example.
Our analysis tells us we can be 95% certain that the Skilled Manual occupation occurs significantly more frequently in the data than we would expect it to.
Does this really mean that Skilled Manual workers prefer our products?
Well, the numbers alone do not tell us that. All the numbers tell us is that we can be 95% sure that an unexpectedly high number of Skilled Manual workers bought our products. But there may be various reasons for this. Maybe our shop is next to a construction site or maybe it is next to a factory.
These important real-world factors are not conveyed in the numbers and instead must come from ourselves, as information workers. We must take the numbers and apply our business knowledge to them.
The numbers don’t contain magical answers, but they do provide us with some interesting properties of our data that we should seek to explain and understand.
We can use the numbers as a tool to help us make intelligent business decisions, but the numbers alone don’t make the decisions for us.