SQL Server 2005 | SQL Server 2008 | SQL Server 2008 R2
Data Mining has been built into SQL Server since version 2005 and it’s quite comfortable and wizard-driven to design your mining models. However, Data Mining is not much about the toolkit but more about data preparation and interpreting the results. Without a proper data preparation, the algorithms will fail in really predicting or clustering the data. And the same is true for the interpretation of the results. But before we can start interpreting the data, we have to trust the results. At the design time of each mining model we can use test case holdouts, lift charts and cross validation to see if the model is robust and meaningful. But most of our prediction models try to predict a future behavior based on the knowledge of today and the past. What if there are significant changes in the market that are not already trained into our models? Is our model still correct or are we missing an important variable?
Today’s post is about implementing a back testing process to validate the mining results. For our example, we use a churn score prediction model. Think of a telecommunication company: Each customer has a 12 months contract that can be cancelled by the customer at the end of the period. If the contract is not cancelled, it’s continued for another 12 months. We want to predict how many customers are going to cancel their contracts during the next 3 months (the 3 months latency has to be build into our training set, but this is a different topic). To make things more simple in the first step let’s assume the company is not going to use the mining results (no churn prevention) but just waits 3 months to compare the reality with the prediction. That’s what we’re doing during a back testing.
So let’s assume that we did a churn prediction 3 months ago. Our results are returned from our data mining model as a table like this:
CaseKey | Churn | ChurnProbability |
1 | true | 87.4 |
2 | false | 7.1 |
3 | false | 1.7 |
4 | true | 50.2 |
5 | false | 11.3 |
6 | false | 16.0 |
7 | false | 6.9 |
8 | false | 1.8 |
9 | false | 2.6 |
10 | false | 18.7 |
… | … | … |
For my example I have 30,000 contracts (cases). Churn prediciton = true means that this contract (case) is likely to be cancelled by our customer (predicted by the mining model). In my dataset this is true for 2010 cases. I left some more columns out here, but usually you are also looking at the support and other measures.
Now, 3 months have passed and we want to check how good our initial data mining model was. As said below we didn’t do anything to prevent our 2010 cases from above to cancel their contract. Now, looking at our CRM system reveals that actually 2730 customers of those 30,000 cancelled their contract. What does this mean? We expected 2010 to cancel but in reality it was 2730. Does this mean our model is wrong? Or can we still rely on the model?
The clue to answering this question is to compute how likely it is, that
a) Our model is correct and
b) We see 2730 customers to cancel their contract
Just to avoid confusion, we’re not looking at the error cases within our prediction (as we would do with a Receiver Operating Characteristics analysis) but we try to validate the model itself.
If the Churn probability is the same for each case we could use a binomial test validate the model(see one of my very first blog posts about this topic). Another way to do this computation is to run a Monte Carlo scenario generator against our data from above. Basically, the test works like this
- Do N loops (scenarios)
- Set the number of cancelled contracts x for this scenario to zero initially
- Look at each case
- Compute a random number r and compare this number with the ChurnProbability p
- If the r<p count this case as cancelled (increment x)
- increment the number of occurrences of x cancellations
I’m showing an implementation of this approach in my next post but for today let’s just concentrate on interpreting the results. For my example I used a SSIS script component to actually perform the Monte Carlo test. I used 30,000 scenarios and ended up with the following result:
As you can see, most of my scenarios ended with approx. 2800 cancellations (peak in the chart). This might be the first surprise. Assuming the mining algorithm was right, there are still much more cancellations happening than being predicted in the predicted churn column. How can this be? Well, actually the predicted value follows a very simple rule: It switches at 50%. This is a strong simplification of the true distribution. So instead of looking at the predicted values you should better look at the expectancy value:
Predicted value | Expectancy value |
select count(*) from Mining_Result where churn=1 | select SUM(ChurnProbability) from Mining_Result |
Result: 2010 | Result: 2784 |
As you can see, the expectancy value matches our distribution histogram from above. In most situations, the expectancy value differs from the value count. This is highly dependent on the distribution of the probability values. For example, doing the same test with the bike buyer decision tree model in Adventure Works I got 9939 cases with a predicted value of 1 for the BikeBuyer variable. Here the expectancy value is about 9135, so in this case it is lower than the number of predicted cases.
Back to our histogram from above. We can easily replace the number of cases by the percentage value of the total cases. This results in the probability density function. In order to proceed we have to use the aggregated density function. For our example, this function looks like this
This function tells us the probability for seeing less than a certain number of cancellation. As expected, the probability to see less then 30,000 cancellations is 100% (as we only have 30,000 customer who could cancel). On the other hand, the probability to see less than 0 cancellations is 0%. Again it may be a surprise to see that actually the probability for seeing less than 2600 cancellations is close to 0 (from the graph above). How does this look around our real number of 2730 cancelled contracts? Here is the extract from the table:
NumCases | TotalProbability % | 1-TotalProbability % |
… | … | … |
2726 | 7.9 | 92.1 |
2727 | 8.3 | 91.7 |
2728 | 8.7 | 91.3 |
2729 | 9.1 | 90.9 |
2730 | 9.5 | 90.5 |
2731 | 9.9 | 90.1 |
2732 | 10.3 | 89.7 |
2733 | 10.8 | 89.2 |
2734 | 11.2 | 88.8 |
… | … | … |
In this table, the total probability is the aggregated probability from our chart above and means the probability for seeing less than NumCases cancellations while 1 minus total probability means the probability to see more than NumCases cancellations.
From this table you can see that the probability for seeing more than 2730 cancellations is still about 90.5%. Now let’s look at the area between 2740 and 2920 cancellations (to reduce the number of lines I’m only showing every 20th row):
NumCases | TotalProbability % | 1-TotalProbability % |
… | … | … |
2740 | 14.4 | 85.6 |
2760 | 28.4 | 71.6 |
2780 | 46.5 | 53.5 |
2800 | 65.1 | 34.9 |
2820 | 80.7 | 19.3 |
2840 | 91.0 | 9.0 |
2860 | 96.6 | 3.4 |
2880 | 98.9 | 1.1 |
2900 | 99.7 | 0.3 |
2920 | 99.9 | 0.1 |
… | … | … |
While it is still likely (86%) to see more than 2740 cancellations, it becomes more and more unlikely with higher the value gets for the cancellation. And seeing more than 2900 cancellations is very unlikely (less than 1%). Of course, this only refers to the case, that the model is operating correctly.
In order to make our back testing an easy procedure we want to define a simple threshold T. Our model passes the test as long as there are no more cancellations than T. If the model does not pass the test, we have to re-validate the variables and check the overall state of the model. We do not want to do this too often. Therefore, the probability for our model being correct and still not passing the test should be less than 10% (remember that we will run the mining prediction over and over again). Now we need to find a proper value for T.
From our table above we can see that T is close to 2840 cancellations. We can query the correct value from our histogram table:
select MIN(numcases) from mining_histogram where 1-TotalProbability<0.1
The result is T=2838 for my example. So our 2730 cancellations from above are definitely below our test threshold T and therefore our model clearly passes the test.
Now that we’ve set a threshold to the condition “model correct and cancellations>2838”, what about the situation in which our model is incorrect. In this case we would assume that the real probability for a customer cancelling the contract is higher than predicted by our model. This is of course only one assumption we could make. Depending on the conditions and the environment, the definition of the “wrong model” can be different. In any case we have to define a “wrong” or alternative model.
For our example, this is how our model looks like with a 3% higher probability for cancellation (blue line).
For this chart, we have to create a separate histogram table for our alternative (wrong) model and also calculate the results in our Monte Carlo process. Again, we can read the probability for the condition “model is wrong but passes the test” from our histogram table. In our case it’s about 24.4%.
If you’re not interested in a more statistical view of our test you can now skip to the conclusion below.
Otherwise you probably already know that there are two possible mistakes we could make here:
- model is correct but fails the test, usually referred to as type 1 or alpha error (false positive)
- model is incorrect but passes the test, usually referred to as type 2 or beta error (false negative)
Model is correct | Model is incorrect | |
Test is negative, meaning the model passes the test | correct result(probability=1-alpha, so called specificity of the test) | type 2 error / beta error |
Test is positive, meaning the model does not pass the test | type 1 error / alpha error | correct result (probability=1-beta, so called power or sensitivity of the test) |
In order to get a better understanding of our test situation we can plot alpha and beta together into one chart:
The green line shows the probability for more than N cancellations in our correct model. The red line shows the probability for less than N cancellations in our wrong model. The bigger our threshold gets (the more we get to the right side in the diagram) the
- lower our alpha gets (lower risk for the type 1 error “model is correct but does not pass the test”)
- higher our beta gets (higher risk for type 2 error “model is incorrect but does pass the test”)
Here are some sample values for different values of T
T | Alpha | Beta |
2780 | 53.5 | 1.8 |
2790 | 44.3 | 3.1 |
2800 | 34.9 | 5.2 |
2810 | 26.5 | 8.4 |
2820 | 19.3 | 12.7 |
2830 | 13.5 | 18.6 |
2840 | 9.0 | 25.9 |
2850 | 5.7 | 34.1 |
2860 | 3.4 | 43.5 |
2870 | 1.9 | 53.0 |
2880 | 1.1 | 62.2 |
If we want to keep alpha below 5% you can see that beta will over 35%. On the other hand, if we try to keep beta below 5% alpha we will over 35%. The “best” values for alpha/beta or not simply the value for T where alpha is almost equal to beta (here T=2826, alpha and beta approx. 16%) but depends on the error that you are wanting to minimize. In our example above we demanded alpha<10% which resulted in T=2830 and beta being about 25%.
In our case the reason for keeping alpha low would be to avoid readjusting the model too often (causing costs). On the other hand, keeping beta low reduces the risk of working with a wrong model and potentially loosing more customers. Basically this decision has to be made before actually defining the threshold value T.
Conclusion
For this example we defined a very simple test (number cancellations < 2838) which satisfies these two criteria:
- It is unlikely (<10%) that our model does not pass the test although it is correct
- It is unlikely (<25%) that our model passes the test although it is wrong (goes off by 3%)
While the 1st criteria means we’re not loosing more customers as expected, the 2nd criteria means we’re not spending too much budget on fine tuning the model.
It should be stated that the above calculations are done on modified (randomized) data. In practical life it can be more difficult to find a proper test for the model and also the tradeoff between error 1 and 2 can be much higher.