Search This Blog

Tuesday, July 13, 2010

How much support do you need for your Data Mining results?

SQL Server 2005 | SQL Server 2008 | SQL Server 2008 R2

When querying your SSAS mining model you may have noted the prediction function predictsupport which is supplied by most of the mining algorithms, for example for the decision tree. The mining algorithm does not only predict a certain variable and gives the prediction score but also tells us how many cases were used to base this decision on (case support). You might have the feeling that cases with a lower support are not that reliable compared to cases with a higher support. This post is about how to determine the needed support for a given model.

To make this theoretical topic a little bit more practical, let’s look at actual data. For this post I’m using the targeted mailing decision tree model (TM Decision Tree) of the Adventure Works SSAS sample database. While you can do the same process described here with a data mining query result, I’m looking at the model itself instead. In order to do so I run the following query on my SSAS database:


FROM [TM Decision Tree].content

WHERE IsDescendant('', '000000001') AND [CHILDREN_CARDINALITY]=0

I left a lot of interesting columns out here to save the space for the query result. There are many more fields you can query (some depending on the mining algorithm).

By selecting “children cardinality equals zero”, we only get the child elements from our decision tree. And since we used the flattened keyword our NODE_SUPPORT element is returned as multiple lines here (one line per possible value, try removing FLATTENED from the query above to see the difference). So this is how the result looks like:


Since our Bike Buyer attribute can have three states (Missing, 0=no bike buyer, 1=bike buyer) we get three lines per node in our flattened decision tree model query. For example the first three lines belong to the node labeled “Year Income < 58000” which can be found at the top of the lower half in the decision tree model viewer:


If you look at the details of this node in the mining model viewer, you will see the following values:


This is almost exactly what we have in our table above. For this node that Mining algorithm found 1362+465=1827 cases that matched the conditions of the node. You can see the full list of conditions in the NODE_DESCRIPTION column (which I left out here). In my case this is

Number Cars Owned = 2 and Region not = 'Pacific' and Yearly Income < 58000

So from all the rows that matched this criteria 1362 had a Bike Buyer variable value of 0 and 465 had a value of 1 during the training of the model. The probability is then computed by the number of the cases, for example 100*465/1827 gives approx. 25.46%. Actually, the decision tree calculates the probability in a different way. For many cases this is very close to the quotient above, but there are differences. If you want to be exactly sure to control the method of calculation you may want to calculate the probability score on your own (for the example above this means to calculate 100*465/1827 instead of taking the probability provided by the mining algorithm).

If we use this model for prediction, probabilities above 50% will be mapped to the positive result. For our example, we want to predict the “Bike Buyer” variable. For cases that fall into the node displayed above, the result will be 0 (as the probability for zero is greater than 50%) meaning that these contacts are unlikely to buy a bike.

In order to proceed, I focus on the prediction value of Bike Buyer = 1. I copy the results from our query above to Excel and removed all lines for result 0 or missing. I also did some formatting of the table. These are the first rows of the resulting table.


In order to decide whether this support is good or bad, we have to decide on which basis we want to do the decision:

  1. Decision based on the predicted value
  2. Decision based on the predicted values probability

In the first case, we’re only using the predicted value from our mining. In our example it would be a prediction of the Bike Buyer variable as a value of true or false. Our concern would be that our support does favor the alternative decision with a high probability.

In the second case, we’re taking the real prediction probability for some further calculation, for example for the back testing (I will discuss this during my next posts) or for calculation expectancy values of our prediction (for example expected costs).

Let’s start with the first case here.

1. Decision based on the predicted value

This case is based on some rule to derive the prediction variable from the prediction score and optionally further attributes. SSAS does this on a very simple basis by changing the decision at 50%. But the rule can be very different. You may consider all rows with probability >30% as potential bike buyers or you may include your estimated revenue. You could also use the profit analysis that is built into SQL Server Data Mining to decide on the rule. But in any case, for your mining results you are only interested on the output of the rule (not on the “real” probability), in our case “true” or “false”.

For our example, I’m using the “default” rule of 50%. In order to understand the effect of the support, let’s look at our concerns. First we start with a case having a probability higher than 50% (or whatever our threshold is), for example take a look at line 3 of the Excel table from above. This case has a probability of 61.28% meaning that 65 our of 106 rows had the Bike Buyer variable set to 1.In this case, our concern is:

The real probability for “Bike Buyer =1” is 50% or lower but still the random sample of data we used during training resulted in 65 positive rows.

If this happened, our model would be choosing the wrong decision for all rows that are matching the node’s criteria.

In order to deal with this concern statistically, I’m doing a simple and common trick here (maybe you should allow yourself some minutes to think about that). Let’s change the concern to

The real probability for “Bike Buyer =1” is exactly 50% but still the random sample of data we used during training resulted in 65 or more positive rows.

For this concern we can simple use Excel’s binomial function to compute the probability:


I used the Excel 2010 function here. If you’re using Excel 2007 or before, the function name is BINOMDIST. The above function returns a value of 0.7%. So it is very unlikely that our model training was based on a wrong decision.

For the lines with a probability of less than 50% our concerns are just the other way round (of course you could also take the same rules as above and look for “Bike Buyer=0”). For our example I reference the first line in the table above:

The real probability for “Bike Buyer = 1” in this specific node is 50% or higher but we still see 465 positive rows in our 1827 cases.

Again, I’m transforming this to:

The real probability for “Bike Buyer = 1” in this specific node is exactly 50% but we still see 465 or less positive rows in our 1827 cases.

In this case, we have to use this function to compute the probability:


Again the probability is very low (almost exactly 0%).

Let’s add this calculation to the Excel-table from above:


The formula for E7 is


This formula is then copied for all lines below.

How do we read this table? To keep things simple, low values in the last column are good (as in our two example values from above). The closer the predicted probability gets to 50% (or in other words, the weaker the prediction gets) the more support is needed to make it relevant.

How do you work with this table? Let’s say you want to be 95% sure that the support is strong enough to prevent our mining system from being trained for the wrong response. In this case you would filter out nodes from the table above that have a value of 5% or above in the last column. This is the result


In this case we would have filtered out 608 from 12939 cases used for the support (approx. 4.7%). What are you going to do with these cases? Well, this depends on your mining question. Typically you would consider these nodes to be bike buyers if you want to minimize the risk of loosing potential customers or you would consider them as not being bike buyers if your goal is to reduce costs (by excluding these cases from the mailing). In any case, you will have to ignore the response from the data mining process if you do not want to risk being misled.

In our case it is ok to drop 608 cases, but what do you do if all cases have a very poor support? In this case, you should review your model carefully. Seeing a lot of cases with a low support usually means that your model is “over trained”. For example, your decision tree model has too many nodes. Mining works best when the real world can be simplified. For the decision tree you might try to increase the required support or set a higher value for the complexity penalty parameter.

The more you want to be sure that your case support is good enough the more cases you will have to drop. For our example here are a few values:


2. Decision based on the predicted values probability

In this case we’re not so much interested in a simple decision (bike buyer true or false) but more about the probability for each case. This also means we’re having to consider different concerns (although similar compared to the ones above). The whole process is very similar to the one above and if you are not too much interested in the outcomes you can just skip to the conclusion.

Basically, we only need to replace the 50% in the example above with an alternative model’s probability derived from the predicted probability. However, as described above, our concerns differ in the cases the favor a positive or negative result. So for our example, we’re considering a cut-off at 50% again.

Let’s start with the lines having a probability higher than 50%. For example, let’s look at the 3rd line from the table above. The probability found during training was 61.28% that is, 65 of the 106 rows hat the Bike Buyer variable set to 1.

In this situation our concern is that the real probability is less than 61.28% but still the random sample we used during training resulted in 65 positive rows. Ok, if the probability is still 61.279% this wouldn’t make much difference. Let’s try with a difference of 3%, so our concern (already transformed as described above) is:

The real probability for “Bike Buyer = 1” in this specific node is exactly 58.28% but we still see 65 or more positive rows in our 106 cases.

Again we’re using Excel’s binomial function to compute the probability:


The above functions returns a value of 23.26%. Although it is below 50% it is still significant (compare this to the 0.7% from above).

For the lines with a probability of less than 50% our concerns are just the other way round. For our example I use the first line from the table above as a reference again:

The real probability for “Bike Buyer = 1” in this specific node is exactly 28.46% but we still see 465 or less positive rows in our 1827 cases.

To compute this in Excel you have to take the following formula:


In the case we get a result of 0.22% meaning it is extremely unlikely.

I added this calculation as an additional column to my Excel table from above. This is the result:


The formula for E7 is


This formula is then copied for all lines below.

Again we can now filter the table for whatever sureness we would like to have. If we want to be 80% sure that your case support does not favor a model that is 3% off, we would filter all lines from the table above that are having a value higher than 20% in the last column.


A lot or nodes do remain in this case (a lot more than in our discrete example from above): In this case we would have to ignore about 19.4% of our predictions.

Here are some sample results for the cases that need to be left out:


As expected, the more you want to be sure the more cases you have to drop. Also the less tolerance you allow for your model, the more cases you have to drop. So it’s up to you to find the best mix of security and usability of the model.

Sunday, July 11, 2010

How to use an Excel 2003 file as a datasource for an SSAS cube

SQL Server 2005 | SQL Server 2008 | SQL Server 2008R2

Ok, usually you wouldn't want to use an Excel file directly as a datasource for an SSAS cube in real life scenarios. But just in case you'd like to set up a quick demo without bothering to create a new database using Microsoft SQL Server or maybe Microsoft Access, the ability of sourcing your cube from an Excel file could be more than welcome. Just imagine you want to try some design ideas. While working with SQL server databases you would end in a large amount of test databases or you would need to backup/restore your databases all the time to test different scenarios. With Excel as a source for your cubes you could put your test data right into your SSAS solution. In order to modify the datasource you can simple make a copy of your Excel file (for backing up the older version) or of your solution instead of caring about databases. And even if you don't have the databases installed you can use any of your testing solutions by just opening the solution as the data becomes part of the solution.

Sounds good, doesn't it? But how can you do so? First, when trying to set Excel as an OLE-DB source you will notice that it just isn't there.


But what we can do, is to use the Microsoft Jet 4.0 OLE DB Provider. So that's where we start. The next dialog asks us to provide the database file name.


If you click on the 'Browse...' button you will notice that the selection is limited to .mdb-files or .accdb-files as the Jet OLE-DB provider as usually used with Microsoft Access databases. So we just change our file type selection to 'All files' and pick our Excel file.


Now, if you click on 'Test connection' you will get an error message like the one below:


Our Jet OLE-DB provider still believes, that we are connecting to a Microsoft Access database file and therefore it cannot connect. So here comes the really important step. We open our connection again, click on the 'Edit' button to edit the connection string and then we switch to the 'All' tab of the connection properties.


As shown in the screenshot we have to set the extended properties to 'Excel 8.0;HDR=Yes;IMEX=1;'.

Excel 8.0 stands for Excel 2003 (I couldn't get Excel 2007 to connect properly using 'Excel 9.0', so I stayed with the Excel 2003 format here). 'HDR=Yes' means that our Excel tables contain headers.

After that, a click on 'Test Connections' gives the desired result:


Of course, we now need to build up our Excel-file. Each "source table" sits on its own sheet. You can easily build up some time dimension or use Excel functions like RAND() to create random fact data or VLOOKUP(...) to link your tables with testing data to each other.


Although this is not at all useful for real life situations (as we would extract the data from the Excel sheet using ETL tools or simply not storing the source data in Excel at all), this might still be useful in order to set up a quick and dirty example solution and play around by modifying the source data (add columns, use different formats etc.) without the need to work on a 'real' database.