Search This Blog

Monday, December 6, 2010

Do you trust your Data Mining results? – Part 3/3

SQL Server 2005 | SQL Server 2008 | SQL Server 2008 R2

This is the third and last post of my Data Mining back testing mini series. For my first posts I made the assumption that no action is taken based on the prediction. Although the mining process marked some of our contracts with a high probability for being cancelled, we did nothing to prevent our customers from doing so.

Of course, this is not the real life scenario. Usually our goal is churn prevention (or some other action based on the mining results). However, any action we take will (hopefully) change the behavior of our customers. Or to be more precise, we expect that less customers are really cancelling their contract. This is always a challenge for our mining process in total as the prevention may “dry out” our mining source data. After a while we might not have any non-influenced customers left with a high probability for cancelling the contract. A test group would help but in many scenarios this is not wanted (because the customers are just too important to risk loosing them just for the sake of some statistical validation). Another method would be to leave the influenced customers out of the training set. This doesn’t disturb our prediction but do we still know our prevention is successful enough?

For our back testing process, we can also add the expected prevention rate to our Monte Carlo procedure. In this case we’re not only trying to validate the model but also the prevention rate. On the other hand, if our test fails we’re not sure if it’s the model or our assumed prevention rate that is actually wrong.

For the following I assume that every customer with a churn probability greater then 35% gets prevention. From the past we know that 90% of our preventions are successful. Here are the results:

Old model (no prevention included):
image_thumb31

New model (including prevention):
image_thumb1

As expected the curves have shifted to the left. In the new model (including the prevention) the number of customers cancelling the contract is significantly lower than in the old model.

We can still use the same methods as shown in the first post to derive the threshold value for our test. This is the curve for the alpha and beta error (again: for beta for choose an alternative model that goes off by 3%).

image_thumb3

Again, here are some values for the threshold T, alpha and beta:

T

Alpha

Beta

1580

88.7

0.0

1600

75.6

0.1

1620

56.7

0.4

1640

36.5

1.6

1660

19.5

5.1

1680

8.4

13.3

1700

3.0

27.0

1720

0.9

46.0

1740

0.2

65.7

For example with T=1680 we get alpha<10% and beta<14%. But as being said above, if our model fails the test we’re now not sure anymore if it is the model or the prevention rate. Again, the best way to avoid this uncertainty would be to use test groups. For example, we could exclude a random sample of 10% of all contracts with a churn score above 35% from prevention. Then we’re able to compare the behavior of the random sample (test group) with the customers that received prevention benefits. This test group enables us to measure the effectiveness of our prevention activity. And of course you could set up a statistical test in the same way we described here, to proof that the test group really supports a certain success rate.

To wrap things up, depending on the products, processes, the way new customers are acquired etc., the real-life data mining problems may be much more difficult than presented in this small blog series. Anyway, the methods described here, like the Monte Carlo algorithm or the statistical hypothesis test, are the main building blocks for achieving a reliable understanding how well the model performs. So, depending on the actual scenario, the above mentioned methods may be combined, adjusted and repeated to get the desired results.

This also ends this mini series about back testing. The basic methods are more or less ubiquitous and may also be used for several similar problems, for example to do the back testing for a financial risk model. By adjusting and combining the methods, a specific business requirement may be modeled into a reliable test. And as long as our model conforms to the test, we can have enough trust in our data mining models to use them as the basis for business decisions.

Sunday, October 10, 2010

Do you trust your Data Mining results? – Part 2/3

SQL Server 2005 | SQL Server 2008 | SQL Server 2008 R2

While the first part of this post was more about the idea and interpreting the results of the test, this part shows how to implement the Monte Carlo test.

First, we need a table with the predicted data mining probabilities. This is the output of the PredictProbability function from your mining result query. I’m using the same source data as in my previous post here. If you like you can easily create your own table and populate it with random probability values in order to test the code for the simulation below:

CREATE TABLE [dbo].[Mining_Result](
[CaseKey] [int] NOT NULL,
[PredictScore] [float] NULL
) ON [PRIMARY]

declare @i int=0

while (@i<10000) begin

insert into Mining_Result(CaseKey, PredictScore)
values(@i, convert(float,CAST(CAST(newid() AS binary(4)) AS int))/2147483648.0/2+.5)

set @i=@i+1

end

Don’t be confused by the convert(…cast…cast newid()…) expression. This is just my approach to calculate a random number within an SQL select statement.

Next we need a table for storing our Mining results:

CREATE TABLE [dbo].[Mining_Histogram](
[NumCases] [int] NOT NULL,
[Count] [int] NULL,
[Perc] [float] NULL,
[RunningPerc] [float] NULL,
CONSTRAINT [PK_DM_Histogram] PRIMARY KEY CLUSTERED
(
[NumCases] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

Then this is how we’re doing our Monte Carlo test:

truncate table Mining_Histogram;

declare @numtrials int = 10000;
declare @cnt int;
declare @lp int;

set @lp=0;

-- perform a monte carlo test:

while (@lp<@numtrials) begin

select @cnt=COUNT(*) from Mining_Result where PredictScore >
convert(float,CAST(CAST(newid() AS binary(4)) AS int))/2147483648.0/2+.5
if exists(select NumCases from Mining_Histogram Where NumCases=@cnt)
update Mining_Histogram set [Count]=[Count]+1 Where NumCases=@cnt
else
insert into Mining_Histogram(NumCases,[Count]) values (@cnt, 1)
set @lp=@lp+1;

end

I’m using the same trick for the random numbers as shown above. In this example, we’re doing 10,000 iterations. For each iterations we compute the number of cases for which the Predicted Score is higher than a random number. For example, if for a certain case the predict score is 0.8 it is more likely that a random number between 0.0 and 1.0 is below the score than for a prediction score of 0.1.

Next, we’re filling the gaps in our histogram table with zeros to make the histogram look nicer:

declare @min int;
declare @max int;
select @min=MIN(NumCases), @max=MAX(NumCases) from Mining_Histogram

set @lp=@min;
while (@lp<@max) begin
if not exists(select NumCases From Mining_Histogram Where NumCases=@lp)
insert into Mining_Histogram(NumCases,[Count]) values (@lp, 0);
set @lp=@lp+1
end

Finally we’re computing the row probability and the running total using this T-SQL:

declare @maxcount float;
select @maxcount=SUM([Count]) from Mining_Histogram;
update Mining_Histogram Set Perc=[Count]/@maxcount;

declare @CaseIdx int
declare @perc float
declare @RunningTotal float =0

DECLARE rt_cursor CURSOR FOR select NumCases, Perc From Mining_Histogram
OPEN rt_cursor

FETCH NEXT FROM rt_cursor INTO @CaseIdx, @perc

WHILE @@FETCH_STATUS = 0
BEGIN
SET @RunningTotal = @RunningTotal + @perc
update Mining_Histogram set RunningPerc=@RunningTotal Where NumCases=@CaseIdx
FETCH NEXT FROM rt_cursor INTO @CaseIdx, @perc
END

CLOSE rt_cursor
DEALLOCATE rt_cursor

After running the simulation this is how the plots of the result look like (using my own values). The first plot shows the value of the field NumCases on the x-axis and the value of the field perc on the y-axis. The second plot has the same x-axis but shows the RunningPerc field on the y-axis:

image_thumb1

image_thumb2

These two plots look very much the same as the plots from my last post (although I used C# code there to generate the histogram data).

If you used the randomly generated scores from above for testing, you will notice the peak being around 5000 cases (instead of 2800 cases in my example).

And if you like a smoother version of the density function (as all the teeth and bumps mainly result from Monte Carlo approach), you could use this SQL query to compute a moving average:

declare @minrange int=0
declare @windowsize int = 50

select @minrange=Min(NumCases) from Mining_Histogram

SELECT H.NumCases, AVG(H1.[Count]) [Count], AVG(H1.Perc) Perc
FROM Mining_Histogram H
left join Mining_Histogram H1 on H1.NumCases between H.NumCases-@windowsize and H.NumCases

where H.NumCases>@minrange+@windowsize

group by H.NumCases

image_thumb4

In order to do the histogram computation automatically with prediction query I recommend putting the code in an SSIS script component. I would also use another type of random number generator. This also allows you to set the seed for the random number generator. For my implementation I used an asynchronous script component that first loads all cases into memory (ArrayList collection), then performs the Monte Carlo test on the in-memory data and then writes the results back to the output buffer. This allows you do work with more scenarios and to log the progress during the loading and testing phase of the component.

I’m planning to write a Books Online Community Technical article on this topic. This article will be more detailed regarding the implementation. I will post a link to this article in my blog then.

Sunday, September 19, 2010

Do you trust your Data Mining results? – Part 1/3

SQL Server 2005 | SQL Server 2008 | SQL Server 2008 R2

Data Mining has been built into SQL Server since version 2005 and it’s quite comfortable and wizard-driven to design your mining models. However, Data Mining is not much about the toolkit but more about data preparation and interpreting the results. Without a proper data preparation, the algorithms will fail in really predicting or clustering the data. And the same is true for the interpretation of the results. But before we can start interpreting the data, we have to trust the results. At the design time of each mining model we can use test case holdouts, lift charts and cross validation to see if the model is robust and meaningful. But most of our prediction models try to predict a future behavior based on the knowledge of today and the past. What if there are significant changes in the market that are not already trained into our models? Is our model still correct or are we missing an important variable?

Today’s post is about implementing a back testing process to validate the mining results. For our example, we use a churn score prediction model. Think of a telecommunication company: Each customer has a 12 months contract that can be cancelled by the customer at the end of the period. If the contract is not cancelled, it’s continued for another 12 months. We want to predict how many customers are going to cancel their contracts during the next 3 months (the 3 months latency has to be build into our training set, but this is a different topic). To make things more simple in the first step let’s assume the company is not going to use the mining results (no churn prevention) but just waits 3 months to compare the reality with the prediction. That’s what we’re doing during a back testing.

So let’s assume that we did a churn prediction 3 months ago. Our results are returned from our data mining model as a table like this:

CaseKey

Churn
(
Churn Prediction)

ChurnProbability
(Churn Probability %)

1

true

87.4

2

false

7.1

3

false

1.7

4

true

50.2

5

false

11.3

6

false

16.0

7

false

6.9

8

false

1.8

9

false

2.6

10

false

18.7

For my example I have 30,000 contracts (cases). Churn prediciton = true means that this contract (case) is likely to be cancelled by our customer (predicted by the mining model). In my dataset this is true for 2010 cases. I left some more columns out here, but usually you are also looking at the support and other measures.

Now, 3 months have passed and we want to check how good our initial data mining model was. As said below we didn’t do anything to prevent our 2010 cases from above to cancel their contract. Now, looking at our CRM system reveals that actually 2730 customers of those 30,000 cancelled their contract. What does this mean? We expected 2010 to cancel but in reality it was 2730. Does this mean our model is wrong? Or can we still rely on the model?

The clue to answering this question is to compute how likely it is, that

a) Our model is correct and
b) We see 2730 customers to cancel their contract

Just to avoid confusion, we’re not looking at the error cases within our prediction (as we would do with a Receiver Operating Characteristics analysis) but we try to validate the model itself.

If the Churn probability is the same for each case we could use a binomial test validate the model(see one of my very first blog posts about this topic). Another way to do this computation is to run a Monte Carlo scenario generator against our data from above. Basically, the test works like this

  • Do N loops (scenarios)
    • Set the number of cancelled contracts x for this scenario to zero initially
    • Look at each case
      • Compute a random number r and compare this number with the ChurnProbability p
      • If the r<p count this case as cancelled (increment x)
    • increment the number of occurrences of x cancellations

I’m showing an implementation of this approach in my next post but for today let’s just concentrate on interpreting the results. For my example I used a SSIS script component to actually perform the Monte Carlo test. I used 30,000 scenarios and ended up with the following result:

image_thumb3

As you can see, most of my scenarios ended with approx. 2800 cancellations (peak in the chart). This might be the first surprise. Assuming the mining algorithm was right, there are still much more cancellations happening than being predicted in the predicted churn column. How can this be? Well, actually the predicted value follows a very simple rule: It switches at 50%. This is a strong simplification of the true distribution. So instead of looking at the predicted values you should better look at the expectancy value:

Predicted value Expectancy value
select count(*) from Mining_Result where churn=1 select SUM(ChurnProbability) from Mining_Result
Result: 2010 Result: 2784

As you can see, the expectancy value matches our distribution histogram from above. In most situations, the expectancy value differs from the value count. This is highly dependent on the distribution of the probability values. For example, doing the same test with the bike buyer decision tree model in Adventure Works I got 9939 cases with a predicted value of 1 for the BikeBuyer variable. Here the expectancy value is about 9135, so in this case it is lower than the number of predicted cases.

Back to our histogram from above. We can easily replace the number of cases by the percentage value of the total cases. This results in the probability density function. In order to proceed we have to use the aggregated density function. For our example, this function looks like this

image_thumb5

This function tells us the probability for seeing less than a certain number of cancellation. As expected, the probability to see less then 30,000 cancellations is 100% (as we only have 30,000 customer who could cancel). On the other hand, the probability to see less than 0 cancellations is 0%. Again it may be a surprise to see that actually the probability for seeing less than 2600 cancellations is close to 0 (from the graph above). How does this look around our real number of 2730 cancelled contracts? Here is the extract from the table:

NumCases

TotalProbability %

1-TotalProbability %

2726

7.9

92.1

2727

8.3

91.7

2728

8.7

91.3

2729

9.1

90.9

2730

9.5

90.5

2731

9.9

90.1

2732

10.3

89.7

2733

10.8

89.2

2734

11.2

88.8

In this table, the total probability is the aggregated probability from our chart above and means the probability for seeing less than NumCases cancellations while 1 minus total probability means the probability to see more than NumCases cancellations.

From this table you can see that the probability for seeing more than 2730 cancellations is still about 90.5%. Now let’s look at the area between 2740 and 2920 cancellations (to reduce the number of lines I’m only showing every 20th row):

NumCases

TotalProbability %

1-TotalProbability %

2740

14.4

85.6

2760

28.4

71.6

2780

46.5

53.5

2800

65.1

34.9

2820

80.7

19.3

2840

91.0

9.0

2860

96.6

3.4

2880

98.9

1.1

2900

99.7

0.3

2920

99.9

0.1

While it is still likely (86%) to see more than 2740 cancellations, it becomes more and more unlikely with higher the value gets for the cancellation. And seeing more than 2900 cancellations is very unlikely (less than 1%). Of course, this only refers to the case, that the model is operating correctly.

In order to make our back testing an easy procedure we want to define a simple threshold T. Our model passes the test as long as there are no more cancellations than T. If the model does not pass the test, we have to re-validate the variables and check the overall state of the model. We do not want to do this too often. Therefore, the probability for our model being correct and still not passing the test should be less than 10% (remember that we will run the mining prediction over and over again). Now we need to find a proper value for T.

From our table above we can see that T is close to 2840 cancellations. We can query the correct value from our histogram table:

select MIN(numcases) from mining_histogram where 1-TotalProbability<0.1

The result is T=2838 for my example. So our 2730 cancellations from above are definitely below our test threshold T and therefore our model clearly passes the test.

Now that we’ve set a threshold to the condition “model correct and cancellations>2838”, what about the situation in which our model is incorrect. In this case we would assume that the real probability for a customer cancelling the contract is higher than predicted by our model. This is of course only one assumption we could make. Depending on the conditions and the environment, the definition of the “wrong model” can be different. In any case we have to define a “wrong” or alternative model.

For our example, this is how our model looks like with a 3% higher probability for cancellation (blue line).

image_thumb9

For this chart, we have to create a separate histogram table for our alternative (wrong) model and also calculate the results in our Monte Carlo process. Again, we can read the probability for the condition “model is wrong but passes the test” from our histogram table. In our case it’s about 24.4%.

If you’re not interested in a more statistical view of our test you can now skip to the conclusion below.

Otherwise you probably already know that there are two possible mistakes we could make here:

  • model is correct but fails the test, usually referred to as type 1 or alpha error (false positive)
  • model is incorrect but passes the test, usually referred to as type 2 or beta error (false negative)
 

Model is correct

Model is incorrect
(alternative model is correct)

Test is negative, meaning the model passes the test

correct result(probability=1-alpha, so called specificity of the test) type 2 error / beta error

Test is positive, meaning the model does not pass the test

type 1 error / alpha error correct result (probability=1-beta, so called power or sensitivity of the test)

In order to get a better understanding of our test situation we can plot alpha and beta together into one chart:

image_thumb13

The green line shows the probability for more than N cancellations in our correct model. The red line shows the probability for less than N cancellations in our wrong model. The bigger our threshold gets (the more we get to the right side in the diagram) the

  • lower our alpha gets (lower risk for the type 1 error “model is correct but does not pass the test”)
  • higher our beta gets (higher risk for type 2 error “model is incorrect but does pass the test”)

Here are some sample values for different values of T

T

Alpha

Beta

2780

53.5

1.8

2790

44.3

3.1

2800

34.9

5.2

2810

26.5

8.4

2820

19.3

12.7

2830

13.5

18.6

2840

9.0

25.9

2850

5.7

34.1

2860

3.4

43.5

2870

1.9

53.0

2880

1.1

62.2

If we want to keep alpha below 5% you can see that beta will over 35%. On the other hand, if we try to keep beta below 5% alpha we will over 35%. The “best” values for alpha/beta or not simply the value for T where alpha is almost equal to beta (here T=2826, alpha and beta approx. 16%) but depends on the error that you are wanting to minimize. In our example above we demanded alpha<10% which resulted in T=2830 and beta being about 25%.

In our case the reason for keeping alpha low would be to avoid readjusting the model too often (causing costs). On the other hand, keeping beta low reduces the risk of working with a wrong model and potentially loosing more customers. Basically this decision has to be made before actually defining the threshold value T.

Conclusion

For this example we defined a very simple test (number cancellations < 2838) which satisfies these two criteria:

  1. It is unlikely (<10%) that our model does not pass the test although it is correct
  2. It is unlikely (<25%) that our model passes the test although it is wrong (goes off by 3%)

While the 1st criteria means we’re not loosing more customers as expected, the 2nd criteria means we’re not spending too much budget on fine tuning the model.

It should be stated that the above calculations are done on modified (randomized) data. In practical life it can be more difficult to find a proper test for the model and also the tradeoff between error 1 and 2 can be much higher.

Friday, September 17, 2010

BI.Quality 2.0 released on Codeplex

SQL Server 2005 | SQL Server 2008 | SQL Server 2008R2

Today we released the new version of BI.Quality on Codeplex. The new release contains many features from the wishlist (check the release notes for details):

  • Consistent table interface within one test case e.g. to create tables via local files or query
  • Unlimited number of queries within one test case
  • New XmlTableWriter and XmlTableReader to export/import a table
  • New CsvTableWriter and CsvTableReader to export/import a table
  • Assert for query execution time
  • Consistent delta within Asserts, absolute or relative
  • Extended error handling, e.g. the AssertTable breaks no more after the first failure.
  • Extended example test suite with all best practice test cases based on Adventure Works DWH 2008
  • All changes are fully downwards compatible to Version 1.0.0
  • Complete refactoring of the codebase for better extendibility
  • Improved error messages and self tests
  • There are only minor changes between 1.9.7 and 2.0.0. Some fixes include the enhanced downwards compatibility for release 1.0 test cases although it is recommended to use the 2.0 syntax for developing new test cases.

    Saturday, August 28, 2010

    Data Mining on a small amount of data

    SQL Server 2005 | SQL Server 2008 | SQL Server 2008 R2

    Data Mining is usually associated with finding previously unknown patterns in a large amount of data. But also a small amount of data may contain patterns, that are difficult to spot. In order to illustrate this, let’s look at the following situation. A customer with 80 stores wants to understand why some stores perform better than others. This is especially important before setting up new stores. It would be great to estimate the performance of the new store before building it up. This would make it much easier to decide on future store locations. Also, it might be useful for optimizing the performance of the existing stores. In order to find out about this, the customer collects some data per store as shown in the following table:

    Criteria Description
    Sales area Area for sales (in square meters)
    Total working hours per week Total working hour for all staff members per week (avg over the last 3 months)
    Total opening hours per week Total hours that the store is open per week
    Location E.g. City center, City or Outskirts
    Location type E.g. Mall, Plaza or separated
    Store interior status Condition of the store, e.g. modern, average or old
    Store age in months How old is this store
    Parking facilities Is it easy to park near the store? E.g. values include good, average, bad
    Average parking costs per hour What are the average parking costs per hour? Zero means free parking
    Number of competitors within 10 minutes walk distance Number of competitors within 10 minutes walk distance
    Number of competitors within 15 minutes driving distance Number of competitors within 15 minutes driving distance
    Buying Power Buying power of the people how live near the store ranging from very low to very high.
    Sales amount per week Average over the last three months

    The data can be retrieved by querying the IT systems (for example HR and ERP), by using a survey (usually the staff in the store knows about the competitors and parking facilities around) or by using external market research data (for the Buying power). In my case, the data is just generated sample data:

    image_thumb2

    After gathering all the data, things got more complicated as expected. How do you “score” each store? Which of the parameters are most relevant? Even if we’re only having 80 rows of data, it is not at all easy to see the dependencies.

    For this example I’m using the Microsoft Office Data Mining Add-In so we can do all the data analysis using Excel. Since it’s Excel everything should be very easy. We want to use the Microsoft Decision Tree algorithm (Icon “Classify” from the ribbon bar):

    image_thumb5

    The process is pretty easy. We have to decide which attribute we want to predict (Sales amount per week) and the wizard does all the rest. Now, here is the complete resulting decision tree:

    image_thumb3

    No branches? What did we do wrong? Well, here are a few steps we should have taken, before simply invoking the decision tree. The most important thing is the proper preparation of the data:

    1. Create relative measures, not absolute ones

    The decision tree is capable of detecting rules like “if A then B” or “if A then not B” or even complicated combinations. However, dealing with continuous values is more difficult. The decision tree does not work quite well with rules like “if A is multiplied by 2, B is multiplied by 1.5” but tries to branch this as “if A is > 20 then B > 15”, “if A is >10, then B >7.5”. This might be especially true with input variables like our sales area or opening hours.

    To quickly analyze the relationship we can use Excel’s Scatter chart type. Let’s start with the store size:

    image_thumb7

    From this chart it seems reasonable to calculate sales by square meter instead of taking the absolute sales amount as there seems to be a more or less linear relationship between the sales and the store size.

    Now let’s have a look at the influence of the opening hours. Again we’re using Excel’s Scatter chart:

    image_thumb9

    As the trend line shows, the relationship seems to be a little bit logarithmic. However, let’s assume it’s also linear. Therefore we’re going to create an additional column in our spreadsheet computing sales by square meter and opening hours. This is the formula for our new column “Relative Sales”:

    =[@[Sales Amount per week]]/[@[Sales Area]]/[@[Total opening hours per week]]

    Of course, you would like to also check the influence of other variables, for example the age of the store:

    image_thumb11

    This one looks pretty scattered, so we’re just taking the age of the store as an input variable.

    But there is another relative measure we should create: The average number of sales persons in the store. We’re simply using this formula:

    =[@[Total working hours per week]]/[@[Total opening hours per week]]

    2. Make the input parameters discrete if possible

    While it’s always a good idea to use discrete values for our input parameters, it almost becomes a must if you’re not having many rows of data. An example for a good discrete value is our location as it can only take these values: City center, City or Outskirts. The fewer the number of buckets, the better is it. If you’re not getting any results, try making the data more simple by choosing less buckets for your discrete values.

    But look at our newly created columns for the relative sales or the average staff:

    image_thumb13

    For our Mining purpose these are too many distinct values and although our decision tree will try to cluster them, we should do this in advance. Therefore we can either use Excel formulas or we could use the functionality of the data mining add-in: The “Explore Data” wizard:

    image_thumb14

    After selecting the table and the column, the wizard analyzes the data and proposes some buckets as shown below:

    image_thumb16

    For my purpose, I reduce the number of buckets to three (be brave!). By clicking the “Add new Column” button the resulting values are added as an additional column to our table:

    image_thumb17

    I’m doing the same for the average staff members (4 buckets), the total opening hours (4 buckets) the store age in months (5 buckets) and the parking costs (4 buckets):

    image_thumb20

    Our number of competitors is here 0-3 and 0-5, so we leave this data unchanged (not too many buckets).

    So let’s try again with our prepared data set. Now, our decision tree looks like this:

    image_thumb21

    As you can see there are only three influences identified by the mining algorithm here:

    image_thumb22

    So, if parking is free, then the parking facilities are important while if parking is not free, the opening hours are important. This is a good start for looking for new store locations.

    Another nice tool from Excel’s data mining add-in is the prediction calculator which can be found on the “Analyze” ribbon:

    image_thumb23

    This one creates a ready to use input sheet in which you can enter the values of a potential new store and Excel immediately computes the likeliness for high sales:

    image_thumb24

    This calculator is based on a different mining model (Logistic Regression). You can also see the impact of the input values on your sales:

    image_thumb25

    In our case, smaller stores seem to perform better. The high value at the opening time below 41.5 hours may indicate that our computed column from above might not be well designed. And while there are quite a lot of stores with a sales area of less than 57 square meters there is only one single store which is already opened for 52 months: Our first store, which is definitely special (always equipped in the most modern style and only selected staff members are chosen to work there). Maybe you want to take this out of the data before doing the analysis.

    So, after the mining you have to review your results properly. And of course you should also verify the other methods of making sure, your model is working fine (lift chart, case support as from my last post etc.).

    Tuesday, July 13, 2010

    How much support do you need for your Data Mining results?

    SQL Server 2005 | SQL Server 2008 | SQL Server 2008 R2

    When querying your SSAS mining model you may have noted the prediction function predictsupport which is supplied by most of the mining algorithms, for example for the decision tree. The mining algorithm does not only predict a certain variable and gives the prediction score but also tells us how many cases were used to base this decision on (case support). You might have the feeling that cases with a lower support are not that reliable compared to cases with a higher support. This post is about how to determine the needed support for a given model.

    To make this theoretical topic a little bit more practical, let’s look at actual data. For this post I’m using the targeted mailing decision tree model (TM Decision Tree) of the Adventure Works SSAS sample database. While you can do the same process described here with a data mining query result, I’m looking at the model itself instead. In order to do so I run the following query on my SSAS database:

    SELECT FLATTENED [NODE_CAPTION]
    ,[NODE_DISTRIBUTION]
    ,[NODE_SUPPORT]
    ,[NODE_DESCRIPTION]

    FROM [TM Decision Tree].content

    WHERE IsDescendant('', '000000001') AND [CHILDREN_CARDINALITY]=0

    I left a lot of interesting columns out here to save the space for the query result. There are many more fields you can query (some depending on the mining algorithm).

    By selecting “children cardinality equals zero”, we only get the child elements from our decision tree. And since we used the flattened keyword our NODE_SUPPORT element is returned as multiple lines here (one line per possible value, try removing FLATTENED from the query above to see the difference). So this is how the result looks like:

    image_thumb2

    Since our Bike Buyer attribute can have three states (Missing, 0=no bike buyer, 1=bike buyer) we get three lines per node in our flattened decision tree model query. For example the first three lines belong to the node labeled “Year Income < 58000” which can be found at the top of the lower half in the decision tree model viewer:

    image_thumb3

    If you look at the details of this node in the mining model viewer, you will see the following values:

    image_thumb4

    This is almost exactly what we have in our table above. For this node that Mining algorithm found 1362+465=1827 cases that matched the conditions of the node. You can see the full list of conditions in the NODE_DESCRIPTION column (which I left out here). In my case this is

    Number Cars Owned = 2 and Region not = 'Pacific' and Yearly Income < 58000

    So from all the rows that matched this criteria 1362 had a Bike Buyer variable value of 0 and 465 had a value of 1 during the training of the model. The probability is then computed by the number of the cases, for example 100*465/1827 gives approx. 25.46%. Actually, the decision tree calculates the probability in a different way. For many cases this is very close to the quotient above, but there are differences. If you want to be exactly sure to control the method of calculation you may want to calculate the probability score on your own (for the example above this means to calculate 100*465/1827 instead of taking the probability provided by the mining algorithm).

    If we use this model for prediction, probabilities above 50% will be mapped to the positive result. For our example, we want to predict the “Bike Buyer” variable. For cases that fall into the node displayed above, the result will be 0 (as the probability for zero is greater than 50%) meaning that these contacts are unlikely to buy a bike.

    In order to proceed, I focus on the prediction value of Bike Buyer = 1. I copy the results from our query above to Excel and removed all lines for result 0 or missing. I also did some formatting of the table. These are the first rows of the resulting table.

    image_thumb6

    In order to decide whether this support is good or bad, we have to decide on which basis we want to do the decision:

    1. Decision based on the predicted value
    2. Decision based on the predicted values probability

    In the first case, we’re only using the predicted value from our mining. In our example it would be a prediction of the Bike Buyer variable as a value of true or false. Our concern would be that our support does favor the alternative decision with a high probability.

    In the second case, we’re taking the real prediction probability for some further calculation, for example for the back testing (I will discuss this during my next posts) or for calculation expectancy values of our prediction (for example expected costs).

    Let’s start with the first case here.

    1. Decision based on the predicted value

    This case is based on some rule to derive the prediction variable from the prediction score and optionally further attributes. SSAS does this on a very simple basis by changing the decision at 50%. But the rule can be very different. You may consider all rows with probability >30% as potential bike buyers or you may include your estimated revenue. You could also use the profit analysis that is built into SQL Server Data Mining to decide on the rule. But in any case, for your mining results you are only interested on the output of the rule (not on the “real” probability), in our case “true” or “false”.

    For our example, I’m using the “default” rule of 50%. In order to understand the effect of the support, let’s look at our concerns. First we start with a case having a probability higher than 50% (or whatever our threshold is), for example take a look at line 3 of the Excel table from above. This case has a probability of 61.28% meaning that 65 our of 106 rows had the Bike Buyer variable set to 1.In this case, our concern is:

    The real probability for “Bike Buyer =1” is 50% or lower but still the random sample of data we used during training resulted in 65 positive rows.

    If this happened, our model would be choosing the wrong decision for all rows that are matching the node’s criteria.

    In order to deal with this concern statistically, I’m doing a simple and common trick here (maybe you should allow yourself some minutes to think about that). Let’s change the concern to

    The real probability for “Bike Buyer =1” is exactly 50% but still the random sample of data we used during training resulted in 65 or more positive rows.

    For this concern we can simple use Excel’s binomial function to compute the probability:

    1-BINOM.DIST(65,106,0.5,TRUE)

    I used the Excel 2010 function here. If you’re using Excel 2007 or before, the function name is BINOMDIST. The above function returns a value of 0.7%. So it is very unlikely that our model training was based on a wrong decision.

    For the lines with a probability of less than 50% our concerns are just the other way round (of course you could also take the same rules as above and look for “Bike Buyer=0”). For our example I reference the first line in the table above:

    The real probability for “Bike Buyer = 1” in this specific node is 50% or higher but we still see 465 positive rows in our 1827 cases.

    Again, I’m transforming this to:

    The real probability for “Bike Buyer = 1” in this specific node is exactly 50% but we still see 465 or less positive rows in our 1827 cases.

    In this case, we have to use this function to compute the probability:

    BINOM.DIST(465,1827,0.5,TRUE)

    Again the probability is very low (almost exactly 0%).

    Let’s add this calculation to the Excel-table from above:

    image_thumb1

    The formula for E7 is

    =IF(C7<0.5,BINOM.DIST(B7,D7,0.5,TRUE),1-BINOM.DIST(B7,D7,0.5,TRUE))

    This formula is then copied for all lines below.

    How do we read this table? To keep things simple, low values in the last column are good (as in our two example values from above). The closer the predicted probability gets to 50% (or in other words, the weaker the prediction gets) the more support is needed to make it relevant.

    How do you work with this table? Let’s say you want to be 95% sure that the support is strong enough to prevent our mining system from being trained for the wrong response. In this case you would filter out nodes from the table above that have a value of 5% or above in the last column. This is the result

    image_thumb5

    In this case we would have filtered out 608 from 12939 cases used for the support (approx. 4.7%). What are you going to do with these cases? Well, this depends on your mining question. Typically you would consider these nodes to be bike buyers if you want to minimize the risk of loosing potential customers or you would consider them as not being bike buyers if your goal is to reduce costs (by excluding these cases from the mailing). In any case, you will have to ignore the response from the data mining process if you do not want to risk being misled.

    In our case it is ok to drop 608 cases, but what do you do if all cases have a very poor support? In this case, you should review your model carefully. Seeing a lot of cases with a low support usually means that your model is “over trained”. For example, your decision tree model has too many nodes. Mining works best when the real world can be simplified. For the decision tree you might try to increase the required support or set a higher value for the complexity penalty parameter.

    The more you want to be sure that your case support is good enough the more cases you will have to drop. For our example here are a few values:

    image_thumb7

    2. Decision based on the predicted values probability

    In this case we’re not so much interested in a simple decision (bike buyer true or false) but more about the probability for each case. This also means we’re having to consider different concerns (although similar compared to the ones above). The whole process is very similar to the one above and if you are not too much interested in the outcomes you can just skip to the conclusion.

    Basically, we only need to replace the 50% in the example above with an alternative model’s probability derived from the predicted probability. However, as described above, our concerns differ in the cases the favor a positive or negative result. So for our example, we’re considering a cut-off at 50% again.

    Let’s start with the lines having a probability higher than 50%. For example, let’s look at the 3rd line from the table above. The probability found during training was 61.28% that is, 65 of the 106 rows hat the Bike Buyer variable set to 1.

    In this situation our concern is that the real probability is less than 61.28% but still the random sample we used during training resulted in 65 positive rows. Ok, if the probability is still 61.279% this wouldn’t make much difference. Let’s try with a difference of 3%, so our concern (already transformed as described above) is:

    The real probability for “Bike Buyer = 1” in this specific node is exactly 58.28% but we still see 65 or more positive rows in our 106 cases.

    Again we’re using Excel’s binomial function to compute the probability:

    1-BINOM.DIST(65,106,0.5828,TRUE)

    The above functions returns a value of 23.26%. Although it is below 50% it is still significant (compare this to the 0.7% from above).

    For the lines with a probability of less than 50% our concerns are just the other way round. For our example I use the first line from the table above as a reference again:

    The real probability for “Bike Buyer = 1” in this specific node is exactly 28.46% but we still see 465 or less positive rows in our 1827 cases.

    To compute this in Excel you have to take the following formula:

    BINOM.DIST(465,1827,0.2846,TRUE)

    In the case we get a result of 0.22% meaning it is extremely unlikely.

    I added this calculation as an additional column to my Excel table from above. This is the result:

    image_thumb9

    The formula for E7 is

    =IF(C7<0.5,BINOM.DIST(B7,D7,C7+$B$4,TRUE),1-BINOM.DIST(B7,D7,C7-$B$4,TRUE))

    This formula is then copied for all lines below.

    Again we can now filter the table for whatever sureness we would like to have. If we want to be 80% sure that your case support does not favor a model that is 3% off, we would filter all lines from the table above that are having a value higher than 20% in the last column.

    image_thumb11

    A lot or nodes do remain in this case (a lot more than in our discrete example from above): In this case we would have to ignore about 19.4% of our predictions.

    Here are some sample results for the cases that need to be left out:

    image_thumb13

    As expected, the more you want to be sure the more cases you have to drop. Also the less tolerance you allow for your model, the more cases you have to drop. So it’s up to you to find the best mix of security and usability of the model.

    Sunday, July 11, 2010

    How to use an Excel 2003 file as a datasource for an SSAS cube

    SQL Server 2005 | SQL Server 2008 | SQL Server 2008R2

    Ok, usually you wouldn't want to use an Excel file directly as a datasource for an SSAS cube in real life scenarios. But just in case you'd like to set up a quick demo without bothering to create a new database using Microsoft SQL Server or maybe Microsoft Access, the ability of sourcing your cube from an Excel file could be more than welcome. Just imagine you want to try some design ideas. While working with SQL server databases you would end in a large amount of test databases or you would need to backup/restore your databases all the time to test different scenarios. With Excel as a source for your cubes you could put your test data right into your SSAS solution. In order to modify the datasource you can simple make a copy of your Excel file (for backing up the older version) or of your solution instead of caring about databases. And even if you don't have the databases installed you can use any of your testing solutions by just opening the solution as the data becomes part of the solution.

    Sounds good, doesn't it? But how can you do so? First, when trying to set Excel as an OLE-DB source you will notice that it just isn't there.

    image_thumb1

    But what we can do, is to use the Microsoft Jet 4.0 OLE DB Provider. So that's where we start. The next dialog asks us to provide the database file name.

    image3_thumb1

    If you click on the 'Browse...' button you will notice that the selection is limited to .mdb-files or .accdb-files as the Jet OLE-DB provider as usually used with Microsoft Access databases. So we just change our file type selection to 'All files' and pick our Excel file.

    image6_thumb2

    Now, if you click on 'Test connection' you will get an error message like the one below:

    image9_thumb3

    Our Jet OLE-DB provider still believes, that we are connecting to a Microsoft Access database file and therefore it cannot connect. So here comes the really important step. We open our connection again, click on the 'Edit' button to edit the connection string and then we switch to the 'All' tab of the connection properties.

    image12_thumb2

    As shown in the screenshot we have to set the extended properties to 'Excel 8.0;HDR=Yes;IMEX=1;'.

    Excel 8.0 stands for Excel 2003 (I couldn't get Excel 2007 to connect properly using 'Excel 9.0', so I stayed with the Excel 2003 format here). 'HDR=Yes' means that our Excel tables contain headers.

    After that, a click on 'Test Connections' gives the desired result:

    image15_thumb2

    Of course, we now need to build up our Excel-file. Each "source table" sits on its own sheet. You can easily build up some time dimension or use Excel functions like RAND() to create random fact data or VLOOKUP(...) to link your tables with testing data to each other.

    image18_thumb2

    Although this is not at all useful for real life situations (as we would extract the data from the Excel sheet using ETL tools or simply not storing the source data in Excel at all), this might still be useful in order to set up a quick and dirty example solution and play around by modifying the source data (add columns, use different formats etc.) without the need to work on a 'real' database.

    Saturday, June 12, 2010

    10 Tips for every SSAS developer

    SQL Server 2005 | SQL Server 2008 | SQL Server 2008 R2

    Within BIDS, our development environment for SSAS applications, the Cube Wizard does a great job. With the wizard you get all the dimensions, the cube and the proper links between the cube and the dimensions. It even takes care of role playing dimensions and names them accordingly. But although it does a great job, the created cube is not ready for the end users. Recently I saw some cubes being just in the state where the wizard left them. So today’s post is about the most essential steps to perform after the cube wizard has finished creating the cube. I don’t go into too much detail here. Some of the topics have been addressed in other posts. Maybe I’m writing about other topics in upcoming posts. So the following list is more like a check list to verify your development.

     

    1. Make the cube user friendly / do not use technical names

    While the operational databases belong to a technical area, the OLAP solution has to be clear and easy to understand for the end users. Avoid technical names, use abbreviations only if they are commonly known and well documented and don’t prefix dimensions with ‘Dim’ or ‘D_’ or something like that (same with the fact-tables). The user wants to analyze sales by time not FactSales by DimTime.

     

    2. Check all dimensions for their attributes

    The cube wizard cannot know which attributes are important for a dimension. Check each attribute carefully. Delete attributes that are not needed at all. It does not make much sense to develop for future requirements. This only makes the cube harder to understand for the end users. Also, think about the usage of the attributes. For example, having an attribute “OptIn” for a CRM cube with members yes and no may not be ideal for usage in a pivot table as you would only see yes or no on the column/row. It is better to have “With OptIn” and “Without OptIn” as members because here, the meaning is immediately clear. Of course, numeric flags like gender being 0 or 1 also do not make much sense. Members should be clear and readable.

     

    3. Create appropriate attribute relationships and create meaningful hierarchies

    Attribute relationship-settings and hierarchies are the key to build a robust OLAP solution that enables query developers to write good MDX queries and also results in a better performance of the cube. However, attribute relationships can be tricky, so make sure you fully understand this topic. You should also use tools like BIDSHelper’s dimension health check to make sure that your attribute relationship really matches the source data.
    For further explanation see http://ms-olap.blogspot.com/2008/10/attribute-relationship-example.html or http://ms-olap.blogspot.com/2008/11/turning-non-natural-hierarchy-into.html

     

    4. Provide properties as needed (not every attribute of your dimension is needed as an attribute hierarchy!)

    Define only those fields of your dimension table as attributes that you need either for hierarchies or as a filter or for pivot axis. Fields that are just informational (like telephone number of the employee for example) don’t make much sense as attribute hierarchies. Set the property “AttributeHierarchyEnabled” to false for these attributes. They are then shown grayed-out in the dimension designer. However, you can still use them in your OLAP tool, for example in Excel. It is important to understand how your attribute relationships are defined here.

     

    5. Set the format for measures and calculations

    The standard format for numbers is just a simple number format. This usually looks ugly in the cube browser or other frontend tools. It’s not a big deal to set the format for every measure. For example, you could use “#,##0.00” for numbers to get a nicely formatted two digit number representation. You should also do this for calculated measures and cube calculations.
    For further explanation see http://ms-olap.blogspot.com/2009/11/how-to-define-excel-compliant-format.html

     

    6. Set the proper dimension type. Especially, define a time dimension

    Setting the dimension type of your time dimension to “time” makes it much easier for your OLAP client (like Excel) to provide special time dependent filtering options. Also, if you’re planning to use semi-additive measures, the time dimension must be marked accordingly.
    Furthermore you also need to qualify the type of the time dimension’s attributes (using Years, HalfYears, Quarters, Months and Date). This is necessary for MDX functions like YTD(…), MTD(…) and ParallelPeriod(…) and for the “Add Time Intelligence” wizard.

     

    7. Define the default measure for the cube

    The default measure is used, when no measure is on the axis or the slicer (where - expression) of the query. If not set, SSAS takes the first measure in the first measure group. This can easily change if new measures or measure groups are created in the cube. Queries that rely on the default measure (which is in turn not a good practice) will then result in a different result. That’s why it is important to set a default measure.

     

    8. Define the default member for dimensions/attributes which are not aggregatable

    Similar to the default measure for the cube, each dimension has a default member that is used, if the dimension is not explicitly present in the query. For example, think of a dimension ‘Scenario’ with members like Actual, Forecast, Midterm plan etc. It does not make sense to aggregate the different members so you will usually set the IsAggregatable property to false. In this case you don’t have an All-element and you should provide a default member (in our example it would usually be ‘Actual’). If not specified, the first member is used which could again lead to errors if new members are created or the member names change. Keep in mind, that there are three ways to specify the default member: in the dimension itself, in the cube script (also useful for role-playing dimensions with different names) and in the security role definition.

     

    9. Create a drill-through action for every measure group and make this the default action

    Using a drill-through action shows the detail data behind a given cell. However, the default drill-through method only shows the name information of the dimensions’ key attributes. Often enough we would see meaning less surrogate keys here which do not make sense for our end users. By defining a drill through action the developer can set the displayed attributes for each dimension making the result of the drill through query much more readable. In addition, making this action the default action also works well with clients like Microsoft Excel. Here, a double-click on a cell triggers the default action.

     

    10. Properly deal with unrelated dimensions

    By default, values on unrelated dimensions are shown as the All-member of the unrelated dimension. For example, in Adventure Works, measures from the Internet Sales measure group are not linked to a reseller (obviously). However, if you try to analyze internet sales by reseller (for example by reseller type) you will see the same value (the total of sales) for each reseller. This is confusing. The property to control this behavior is the IgnoreUnrelatedDimensions setting in the measure group properties. When setting this to ‘false’ values for unrelated dimensions are not shown anymore. This is much easier to understand for the end users.
    For further explanation see http://ms-olap.blogspot.com/2010/04/properly-showing-values-for-unrelated.html

     

    Whats’ next?

    So that’s all? Of course not. First, all the tips from above do not prevent a bad architecture or cube structure. A good architecture is important for each Business Intelligence solution and there are many good books (for example Kimball’s “The Data Warehouse Toolkit”) out there about setting up such an infrastructure. In general, building an SSAS solution on top of a well made dimensional model instead of some OLTP systems is a good idea.

    And of course, the tips from above are not complete in any sense. If you asked someone else, you may get different tips. For example, it’s also important to set up the cube security or to define some time intelligence for easier analysis (like YTD-computations or year-over-year growth etc.). It is also a good practice to simplify complex cubes with many measure groups and dimensions using perspectives to make them easier to understand by the end user. And as many measures/calculations are not self-explaining just from their name, it is a good idea to link an html-document with a specification or helpful text with the member. You can use a cube action for this.

    However, I guess you all have many more good ideas with best practices and most important tips. If you like, I would be very happy if you could share these tips with me and - who knows – maybe there will be a post like “25 tips for every SSAS developer” in the near future.

     

    By the way, this is the second anniversary of my OLAP blog. It was quite an interesting time so far and I’m looking forward to more topics to write about. The main location of this blog is http://ms-olap.blogspot.com. The posts are mirrored to http://oraylis-olap.spaces.live.com/ and also reposted to the new blog system of ORAYLIS at http://blog.oraylis.de (here you can find more interesting posts about Sharepoint, Business Intelligence in general and much more). My posts are also readable by using some blog aggregators like http://www.ssas-info.com/ which I really recommend to learn more about SSAS, MDX and OLAP.