## Sunday, June 29, 2008

### Binomial distribution for a KPI status (part 2)

In the last post we discussed some background on statistics. Now let's turn this into a KPI target. If the parameters of the test are well known in advance the limits for our KPI can be computed for example using Excel formulas.

But if the parameters are based on user input, the computation of the thresholds has to be done in our OLAP engine (MDX). For example, in risk management we compute the Value-at-Risk (VaR) for each day. The VaR is a measure giving us the maximum loss with a certain confidence level. If our confidence level is e.g. 5% the VaR gives our maximum loss for 95 out of 100 days. In the remaining 5 days we expect our loss to be higher (so the VaR is actually a quantile of the probability function of our win/loss). The computation of the VaR requires some kind of model and during the backtesting process we make sure that the model is working properly. So our hypothesis is that the model is correct while the alternative hypthesis is that the model is wrong. Our model is correct if the VaR computed based on the model meets the really observed losses. We count the days in which the actual loss was higher than predicted by the VaR. If it's 5% we can be say, that our model is correct. If not, it may be that the model is incorrect or the sample isn't representative.

If the number of days (maybe the period) is a parameter (maybe a report parameter or a value taken from another dimension), we have to do the computation in the OLAP model.

For this, we have to start with the binomial distribution given by

where n is the total number of cases (days for which we're doing the backtesting), k is the number of observed losses higher than the predicted loss, p is the probability for such a higher loss (in our example 5%), q=1-p (the confidence level of our VaR calculation) and

For example for k=0 we get

As we've seen in the last post, we need the aggregated probability which looks like this:

Now we can define our KPI status for example as

The calculation of the above function is a little tricky and cannot easily be done in MDX. We could call the Excel-Function BINOMDIST, but I didn't get it to work (not all Excel functions are also available for MDX). So here is an approach using an SSAS stored procedure.

Instead of computing the sum over the binomials we actually compute the incomplete beta integral which works much better than dealing with the large numbers resulting from the factorials. There is an excellent set of statistic function available at http://www.alglib.net for free and I used the code from there (see http://www.alglib.net/specialfunctions/distributions/binomial.php to download the code).

An SSAS stored procedure doesn't have much requirements. In its minimal form, it consists of a single static class with one or more public static functions. The code has to be compiled to a class library (Assembly DLL-File) which can then be referenced from SSAS (by simply adding it to the assemblies using Management Studio). There are many code samples available at Codeplex: http://www.codeplex.com/ASStoredProcedures

In our case, the code for the class (C#) simply looks like this:

using System;

namespace ASStatistics
{
public class Statistic
{
public static double BinomialDistribution(int k, int n, double p) {
return binomialdistr.binomialdistribution(k, n, p);
}

}
}

The function call is taken from the AlgLib library above. The MDX code for calling the library function (once the Assembly is registered in SSAS) may look like this

WITH
MEMBER bindist AS
ASStatistics.BinomialDistribution(5, 500, 0.01)
SELECT
bindist ON 0

This gives the following result:

So, if the number of cases if fixed, say 500, the KPI status expression for our Backtesting KPI may look like this:

case
when ASStatistics.BinomialDistribution(KPIVALUE("Backtesting"),500,0.01) <= 0.95 then 1
when ASStatistics.BinomialDistribution(KPIVALUE("Backtesting"),500,0.01) <= 0.9999 then 0
else -1
end

The above confidence levels of 95% and 99.99% correspond to the Basel II recommendation for backtesting of the VaR.

## Saturday, June 28, 2008

### Great video on Data Mining available at Microsoft homepage for free

Rafal Lukawieki did a great presentation about data mining. The video is available at the Microsoft homepage and it's actually a level 400 session (although named "Introduction into Data Mining). But it also contains some background information and overview so it's really worth seeing it.

You can download or watch the movie online here:

http://www.microsoft.com/emea/spotlight/sessionh.aspx?videoid=865

## Saturday, June 21, 2008

### Binomial distribution for a KPI status (part 1)

This entry is not much about OLAP or MDX but gives some background on statistical tests that may be very useful in designing meaningful KPIs based on statistical hypothesis tests. In KPIs we usually compare actuals with goals and visualize the results as a traffic light. In most cases both the actual value and the target can be read from the cube. It is very common to have a dimension "Scenario" with members Actuals and Forecast or something like that.

But when dealing with statistical results, things are getting a little bit more difficult. For example, we're testing the quality of products in a manufacturing line. If the quality test is expensive we have to rely on a spot sample. Let's assume we want a quality of 99%. This means that only 1 of 100 produced items should fail the quality test. So for our spot test let's say we pick 500 out of 10000 produced items of a charge and perform the quality check on them. Depending on the result of the test we decide if we are going to keep the remaining 9500 items or not.

In test theory we say our hypothesis (aka null hypothesis, H0) for a statistical hypothesis test is that the quality of our charge is 99%. The alternative hypothesis (H1) would be that our hypothesis is wrong, meaning the quality is below 99%. So the question is, based on the result of our spot sample, do we keep the hypothesis or not. More precisely, our null hypothesis is, that the probability for a damaged items follows a binomial distribution with a single probability of p=1%. The binomial distribution function B(n,k,p) gives the probability that exactly k out of n events happen when the single probability for an event is p. In our example, B(500,5,1%) ~ 17.6% gives the probability that we observe exactly 5 damaged items in our 500 item spot sample, assuming that the probability for a damaged item is 1%.

Without going too deep into statistics, it's important to know that there are two kinds of error we have to face:

• type I error (aka alpha error or false positive)
• type II error (aka beta error or false negative)

A type I error means that we discard the null hypothesis (the whole charge would be thrown away) although it actually meets our quality criteria while a type II error means that we don't discard our hypotheses (we deliver the whole charge although it doesn't meet the criteria). It's impossible to minimize both kind of errors (see http://en.wikipedia.org/wiki/Type_I_and_type_II_errors for details) at the same time.

For our test let's take a look at the probabilities for the binomial distribution function (we'll cover the calculation of these probabilities in the next post but in the meanwhile you can use the Excel function BINOMDIST to compute these values):

 number of items failing the test single probability cumulated probability 0 0.7% 0.7% 1 3.3% 4.0% 2 8.4% 12.3% 3 14.0% 26.4% 4 17.6% 44.0% 5 17.6% 61.6% 6 14.7% 76.3% 7 10.5% 86.8% 8 6.5% 93.3% 9 3.6% 96.9% 10 1.8% 98.7% 11 0.8% 99.5% 12 0.3% 99.8% 13 0.1% 99.9% 14 0.0% 100.0%

When plotting the probability we clearly see the peak at about 5 items which corresponds to our hypothesis that the probability of a damaged item is 1% as 1% of 500 items gives 5 items.

If we find no damaged items in our spot-test we can be pretty sure that our charge meets the quality criteria. And if we find 14 damaged items? Well, obviously the charge shouldn't be delivered. But what do we do with maybe 8 damaged items? We have to define a threshold c. When X, the number of damaged items is less or equal than c we keep the charge, if X is greater than c we don't keep the charge. This definition makes our test-procedure: "take 500 out of 10000 items for the test. If more than c fail the test, discard the whole charge". But how to find c?

Usually we would set a confidence level alpha for the type I error. Then we could compute the proper value of c that matches this confidence level (i.e. the smallest value for c for which the type I error is below our confidence level alpha). In mathematical terms this would read

If c is zero we only keep charges that have no damaged items in the spot test. Let's compute the type I error in this case. We have to compute the probability that our hypothesis is abandoned although it is true. In mathematical terms this would read

The value can simply be taken from the table above as P(X>0|p=1%) = 1-P(X=0|p=1%) = 1-0.007=0.993

If we decide for an alpha of 5% we would choose c=9, meaning that we discard charges having more than 9 damaged items in the spot tests as this has a probability of about 3.1% (the sum of the single probability values from the table above for c=10, 11, 12, ...) which is below 5%.

On the opposite, let's compute the type II error for this case. Here we have to compute the probability that our hypothesis is kept although it isn't true. Our hypothesis is wrong if the probability for a damage is not 1%. For the calculation we assume a higher probability, for example 2%. So in mathematical terms this would read

The complete table for the type I and type II error probabilities looks like this:

 c alpha beta (2%) 0 99.3% 0.0% 1 96.0% 0.0% 2 87.7% 0.3% 3 73.6% 1.0% 4 56.0% 2.8% 5 38.4% 6.5% 6 23.7% 12.8% 7 13.2% 21.7% 8 6.7% 33.1% 9 3.1% 45.7% 10 1.3% 58.3% 11 0.5% 69.8% 12 0.2% 79.3% 13 0.1% 86.7% 14 0.0% 91.9%

So, if we set c=0 (very strict) it is very likely that we discard charges that are ok and very unlikely that we don't discard a charge although it is damaged. On the other hand, if we set c=14 it's very unlikely that we discard charges that are ok and very likely that we don't discard charges that are damaged.

Of course, test theory is much more complex. Besides alpha and beta you can also consider the so called power (1-beta) in order to decide if your test setup is significant. The power gives the probability for a statistical test, to decide for the alternative hypothesis in case the alternative hypothesis is correct. In general terms keeping the hypothesis while the power is low (i.e. the type II error probability is high) doesn't make much sense. Alpha may also be considered as the risk of the manufacturer while beta may be considered as the risk of the buyer (in case there is no warranty). The proper choose of alpha and beta determines much of the quality of the test. For example, Cohan suggests beta being less or equal to four times alpha (α≤4β, see Cohan, J. 1969, "Statistical Power Analysis for the Behavior Science") for medicial hypothesis tests. So if alpha is 5%, beta should be less or equal to 20% (power greater or equal to 80%). Of course you can increase the significance of your test (and reduce the error probability) by taking a bigger spot sample but on the other hand this is usually more expensive, so you have to find an optimal compromise between costs and significance of the test.

Now, finally back to our well known KPIs. Instead of using a test setup with only one threshold we could use a KPI with three different areas, for example like this:

 green probability of accepting a damaged charge is low (type II error) red probability of not accepting a correct charge is low (type I error) yellow hard to decide area in between

This might lead to the following KPI color definition:

 color condition values green if beta is below 20% 0 ... 6 red if alpha is below 5% 9 ... 500 yellow everything else 7 .. 8

Using this approach resembles the conflict of minimizing both errors in a good way.

Although the probabilities can be computed using the Excel function BINOMDIST, I didn't manage to get this called from MDX (you can call many, but not all Excel functions from within MDX using the Excel!functionname-syntax). So the next post will be more about how to compute this distribution function in MDX using an SSAS stored procedure.

## Tuesday, June 17, 2008

### MDX Studio by Mosha Pasumansky

Mosha wrote a nice tool for testing and debugging MDX queries. It can be downloaded via Skydrive:

http://cid-74f04d1ea28ece4e.skydrive.live.com/self.aspx/MDXStudio/v0.2.6/MDXStudio.exe

The tool is really helpful and you should check for new releases frequently.

### Calculating Quantiles in MDX

Microsoft MDX offers a huge set of mathematical function, which can be further enhanced by using Excel-functions. Excel-functions may be called as Excel!functionname but the performance for this approach is not really good. Besides that, not all of your every-day functions are available through MDX. Before using Excel-functions or writing your own SSAS stored procedures it's worth taking a look at the built-in capabilities of MDX.

In this post we're talking about the calulcation of quantiles. Quantiles are used e.g. for computing risk measures like the so called Value-At-Risk (VaR). The Value-At-Risk is usually a 1% or 5% quantile of the P&L distribution over a set of scenarios (typically created by a monte carlo simulation).

In the following MDX-sample we calculate a simple quantile over a set of days (listed in a dimension "Date"). In order to compute the p% quantile of a set of n values we first have to sort the values and take the element at position (n-1)*p%+1. Usually we will not end up with an integer here but we don't want to complicate things here too much.

So let's try this code in the Adventure Works cube script:

CREATE MEMBER CURRENTCUBE.[MEASURES].[Quantil5Prc] AS
max(bottomcount([Date].[Date].[Date],
int((Count([Date].[Date].[Date])-1)*0.05)+1,
[Measures].[Gross Profit]),[Measures].[Gross Profit]),
NON_EMPTY_BEHAVIOR = { [Measures].[Sales Amount], [Measures].[Total Product Cost] },
VISIBLE = 1 ;

We use a bottomcount here to get the smallest p% elements. By using the max-function we can easily get the biggest of those values (which is actually the last if we had sorted the complete set). The expression int(Count([Date].[Date].[Date]) gives the count of days in our date dimension. Be aware that if you have the unknown member enabled for the dimension this element is also counted so you should disable the unknown member or reduce the count by 2 instead of 1.

For Adventure Works I found 1158 days in the Date dimension. Calculating the quantile for single elements or sets of elements executes pretty fast. Of course, query performance depends on the value of p%. For calculating a 90% quantile with the above formula the same query takes longer than a 5% quantile. So it's better to provide a low p% and switch to topcount instead of sorting too many rows.

The simple test query I used looked like this:

select {[MEASURES].[Quantil5Prc]} on 0

The query results in the value of 986.6533\$ meaning that only in 5% of all days we have a gross profit below this value. Of course, you could put [Product].[Category].[Category] for example on the rows to get a more detailed result.

For further enhancements think of putting the p% value in a separate dimension with given numbers and a reasonable default member. Then you could easily switch the percentage of the quantile or compare different quantiles in one pivot table.

### Getting started with Microsoft OLAP and MDX

Fortunately, Microsoft offers the whole starter kit for SQL Server Analysis Services in its SQL Server 2005/2008 Developer Edition. The Developer Edition also includes samples and tutorials to get you started building your own cubes and reports.

There are still some interesting tools to make your life a lot easier:

Easily deploy MDX scripts without the need to process the cube (see below for another way to do this). This really saves time. BIDS helper can do a lot more things like design aggregations. It seamlessly integrates with Visual Studio and adds some useful context menu entries.
• MDX Script Performance Analyzer (Download at Codeplex)
The Performance Analyzer deconstructs your MDX cube script and activates each line separately while measuring the execution time for a given query. By doing this you can easily see the bottle necks of your MDX cube script.

Useful tip: If you need to deploy just the cube script and you don't have BIDS helper installed, you can also deploy the cube as usual but then click 'Cancel' in the process dialog. It's not necessary to process the cube if you've just changed the cube script. Since processing usually takes much longer than the deployment, this really saves time.

There are of course other useful tools in the SQL Server 2005/2008 package like the SQL Server Profiler that can also be used to profile MDX queries, cache and SSAS aggregations.

### Article on Datamining in Computerwoche

The Computerwoche, a German computer magazine has just published an article of me about Datamining. The article is in German language and can be found here:

## Monday, June 16, 2008

This blog is dedicated to Microsoft OLAP technology and solutions using this architecture.

Features

• Microsoft OLAP Server (Microsoft SQL Server Analysis Services, SSAS)
• MDX query language, calculations and KPIs
• Architecture and performance tips & tricks
• Datamining