Search This Blog

Sunday, October 10, 2010

Do you trust your Data Mining results? – Part 2/3

SQL Server 2005 | SQL Server 2008 | SQL Server 2008 R2

While the first part of this post was more about the idea and interpreting the results of the test, this part shows how to implement the Monte Carlo test.

First, we need a table with the predicted data mining probabilities. This is the output of the PredictProbability function from your mining result query. I’m using the same source data as in my previous post here. If you like you can easily create your own table and populate it with random probability values in order to test the code for the simulation below:

CREATE TABLE [dbo].[Mining_Result](
[CaseKey] [int] NOT NULL,
[PredictScore] [float] NULL

declare @i int=0

while (@i<10000) begin

insert into Mining_Result(CaseKey, PredictScore)
values(@i, convert(float,CAST(CAST(newid() AS binary(4)) AS int))/2147483648.0/2+.5)

set @i=@i+1


Don’t be confused by the convert(…cast…cast newid()…) expression. This is just my approach to calculate a random number within an SQL select statement.

Next we need a table for storing our Mining results:

CREATE TABLE [dbo].[Mining_Histogram](
[NumCases] [int] NOT NULL,
[Count] [int] NULL,
[Perc] [float] NULL,
[RunningPerc] [float] NULL,
[NumCases] ASC

Then this is how we’re doing our Monte Carlo test:

truncate table Mining_Histogram;

declare @numtrials int = 10000;
declare @cnt int;
declare @lp int;

set @lp=0;

-- perform a monte carlo test:

while (@lp<@numtrials) begin

select @cnt=COUNT(*) from Mining_Result where PredictScore >
convert(float,CAST(CAST(newid() AS binary(4)) AS int))/2147483648.0/2+.5
if exists(select NumCases from Mining_Histogram Where NumCases=@cnt)
update Mining_Histogram set [Count]=[Count]+1 Where NumCases=@cnt
insert into Mining_Histogram(NumCases,[Count]) values (@cnt, 1)
set @lp=@lp+1;


I’m using the same trick for the random numbers as shown above. In this example, we’re doing 10,000 iterations. For each iterations we compute the number of cases for which the Predicted Score is higher than a random number. For example, if for a certain case the predict score is 0.8 it is more likely that a random number between 0.0 and 1.0 is below the score than for a prediction score of 0.1.

Next, we’re filling the gaps in our histogram table with zeros to make the histogram look nicer:

declare @min int;
declare @max int;
select @min=MIN(NumCases), @max=MAX(NumCases) from Mining_Histogram

set @lp=@min;
while (@lp<@max) begin
if not exists(select NumCases From Mining_Histogram Where NumCases=@lp)
insert into Mining_Histogram(NumCases,[Count]) values (@lp, 0);
set @lp=@lp+1

Finally we’re computing the row probability and the running total using this T-SQL:

declare @maxcount float;
select @maxcount=SUM([Count]) from Mining_Histogram;
update Mining_Histogram Set Perc=[Count]/@maxcount;

declare @CaseIdx int
declare @perc float
declare @RunningTotal float =0

DECLARE rt_cursor CURSOR FOR select NumCases, Perc From Mining_Histogram
OPEN rt_cursor

FETCH NEXT FROM rt_cursor INTO @CaseIdx, @perc

SET @RunningTotal = @RunningTotal + @perc
update Mining_Histogram set RunningPerc=@RunningTotal Where NumCases=@CaseIdx
FETCH NEXT FROM rt_cursor INTO @CaseIdx, @perc

CLOSE rt_cursor
DEALLOCATE rt_cursor

After running the simulation this is how the plots of the result look like (using my own values). The first plot shows the value of the field NumCases on the x-axis and the value of the field perc on the y-axis. The second plot has the same x-axis but shows the RunningPerc field on the y-axis:



These two plots look very much the same as the plots from my last post (although I used C# code there to generate the histogram data).

If you used the randomly generated scores from above for testing, you will notice the peak being around 5000 cases (instead of 2800 cases in my example).

And if you like a smoother version of the density function (as all the teeth and bumps mainly result from Monte Carlo approach), you could use this SQL query to compute a moving average:

declare @minrange int=0
declare @windowsize int = 50

select @minrange=Min(NumCases) from Mining_Histogram

SELECT H.NumCases, AVG(H1.[Count]) [Count], AVG(H1.Perc) Perc
FROM Mining_Histogram H
left join Mining_Histogram H1 on H1.NumCases between H.NumCases-@windowsize and H.NumCases

where H.NumCases>@minrange+@windowsize

group by H.NumCases


In order to do the histogram computation automatically with prediction query I recommend putting the code in an SSIS script component. I would also use another type of random number generator. This also allows you to set the seed for the random number generator. For my implementation I used an asynchronous script component that first loads all cases into memory (ArrayList collection), then performs the Monte Carlo test on the in-memory data and then writes the results back to the output buffer. This allows you do work with more scenarios and to log the progress during the loading and testing phase of the component.

I’m planning to write a Books Online Community Technical article on this topic. This article will be more detailed regarding the implementation. I will post a link to this article in my blog then.