Search This Blog

Sunday, April 24, 2016

Why doesn’t my decision tree recognize a simple pattern?

Data Mining algorithms are sometimes perceived to be extremly intelligent and established with many super powers, humans don’t have. In today’s post I’d like to show a simple example in which a mining model fails to detect a very simple rule in the data. This illustrates the need for feature engineering not only to enhance the data with new associations but also to include simple calculations, for example deltas instead of absolute values.

I’m starting with a very simple pattern of two variables (a, b) and a single output (y):


As you can see, “Y” equals TRUE, if “A” and “B” are identical (boolean operation is “not xor”). So, this is a very simple pattern, which should be easily detected by a data mining algorithm. In order to use this data as source for an algorithm, I manifolded this pattern 50 times to result in 200 rows of data with this pattern. In R this reads as follows:

> newdata <- data.frame(
> data <-"rbind", replicate(50, newdata, simplify = FALSE))
> data$y <- as.factor(data$a==data$b)

Let’s start with a simple decision tree based on this data:

> fit <- rpart(y~.,data)

The result shows that no split has been detected in the tree or – in other words – the rpart model didn’t understand our simple rule:

n= 200

node), split, n, loss, yval, (yprob)
      * denotes terminal node

1) root 200 100 FALSE (0.5000000 0.5000000) *

Instead, all observations are treated as “FALSE” in our root node. Consequently, if we ask the model to do the prediction, it always responds with the answer “false”:

> cbind(newdata,predict(fit,newdata, type="class"))

  a b predict(fit, newdata, type = "class")
1 1 0                                 FALSE
2 0 0                                 FALSE
3 1 1                                 FALSE
4 0 1                                 FALSE

The reason for this behavior is easy to understand if you look at the decision the model has to make for the first split. Since our data cases are perfectly balanced, neither “A” nor “B” seems to have any effect on the output. For example, if “A” equals zero you find “Y” having 50% true and 50% false (see the table from above). The same situation is seens with “A” equals one. And the same situation is true for “B”. So the feature selection algorithm decides that none of our input variables has a univerariate influence on the output variable.

> table(data$a, data$y)

0    50   50
1    50   50

So the reason for our decision tree not detecting the pattern is the unnatural perfectly balanced structure of the data. If we use a random forest instead, the draws wouldn’t be equally balanced and the algorithm would detect the pattern:

> fit <- randomForest(y~.,data)
> plot(fit)
> cbind(newdata,predict(fit,newdata, type="class"), predict(fit,newdata, type="prob"))

  a b predict(fit, newdata, type = "class") FALSE  TRUE
1 1 0                                 FALSE 0.728 0.272
2 0 0                                  TRUE 0.264 0.736
3 1 1                                  TRUE 0.218 0.782
4 0 1                                 FALSE 0.752 0.248

Ok, this looks much better. And also for our decision tree, if we take a sample (75% of the data rows resulting in 150 rows for this example), the result looks much better. Here’s the resulting tree:


As you can see, the leaf nodes are “perfect” with no error. So our pattern has been detected in this case. Of course you wouldn’t expect such a perfect distribution as I constructed for this demonstration, but still this effect could be an issue for your model. As it is unlikely to have such a balanced distribution in real life, it is as likely that you’re dealing with more than just two input variables. So, let’s add a third variable which approximates “Y” with a success rate of 90%:

> data$c<-as.factor(ifelse(data$y==TRUE,ifelse(runif(n=nrow(data))<=.9,1,0),ifelse(runif(n=nrow(data))<=.9,0,1)))

Now “C” is a good but not perfect approximation of “Y”. Let’s retrain the tree using “A”, “B” and “C” as input variables. As you can see from the tree visualization below, the algorithm chose the “easy” way and used our approximation as the only split condition, instead of detecting the pattern (as in the case where “C” was missing):


As a consequence our true prediction rate goes down from 100% (pattern detected) to about 90% (approximation variable used). 14 out of 150 cases failed. Ok, this is still good but the point is, that we’re getting better results if we think about the way splits are calculated and therefore add additional attributes (in this case we could have added “A==B” as a helper attribute) to support the decisions. There are many more examples:

  • Instead of absolute values (temperature 26°, 28°, 32°) it might be better to add relative (delta) values (temperature delta 2°, 4°)
  • Instead of two absolute variables (energy consumption and workload) it might be a good idea to also add the ratio (energy consumption / workload)
  • Difference/Change of a value compared to the average (or a regression) of a given time range before (for example, relative change of a value compared to the linear regression of the last 3 hours).
  • As shown in the example above: for two variables it might be good to add a measure of similarity (equal or almost equal)


Data Mining algorithms like the decision tree sometimes struggle with simple patterns in data. Instead of feeding the data “as it is” into the algorithm, it’s often a good idea to add some derived information (delta, ratio, correlation, measure of equality etc.) to support the model.

Sunday, April 3, 2016

Using R Server in SQL Server 2016 to calculate the expected value for CRM leads (Monte Carlo method)

SQL Server 2016

In this post I’m demonstrating a use case for the new R Server (formerly known as Revolution R Enterprise) in SQL Server 2016 to calculate the expected value for CRM leads using a Monte Carlo approach.

To keep things simple, let’s assume we have the following leads in our CRM:


For example, the chance for winning lead number 7 with $1,000,000 is 10%. So what is the amount of incoming orders we can plan with (assuming the probabiltity for the individual lead is correct)? A common approach is to use a weighted sum (sum over probability times value), which is easy to calculate in T-SQL:

select sum([Probability]*[Value]) ExpectedValue from CRM


While this approach works well with a large number of leads of similar size, for the example above we have to realize that $100,000 of the $256,000 result from the relatively unlikely win of lead number 7. And in fact, we could win or loose this lead which means a value of 0 or 1 million but nothing in between. So this approach may be misleading with skewed data.

Another approach is to use a threshold and only count the leads with a probability above the threshold. The query would look somewhat like this:

select sum([Value]) ExpectedValue from CRM where [Probability]>=.7


Here we’re only counting leads with a probability of at least 70%. We just need to be sure not to understand the threshold of 70% as a probability here. It would be wrong to interpret the result in a way like “with a probability of 70% we can expect incoming orders of at least $52,000”. The reason is that each lead can be a win or loss independently from the other leads.

So, what could be a more realistic method to estimate the expected value of the leads from above? One idea could be to simulate cases where each lead can be converted in an order or not at the individual probability of the lead. If we run say 100,000 such simulations we can look at the distribution of the results to get a better understanding of the resulting total. This approach is called Monte Carlo method. While we could implement this in T-SQL (for example look at an older blog post of mine about Monte Carlo in T-SQL), it’s easier to do so in R and with the new R Server capabilities in SQL Server 2016 we can better use this to do the calculation (see here for the basics about T-SQL stored procedures in R).

Let’s start with the resulting procedure code before I go into more details:

EXEC sp_execute_external_script
@language = N'R'
, @script = N'
eval<-function() {sum(ifelse(runif(min = 0, max=1, n=nrow(mydata))<=mydata$Probability, mydata$Value,0))}
q<-quantile(r,  probs = c(0.1, 0.5, 1, 2, 5, 10,25,50,100)/100)
, @input_data_1 = N'select  CRMID, Probability, Value from CRM'
, @input_data_1_name=N'mydata'
, @output_data_1_name = N'result'
  [value] float
  ,quantile nvarchar(10)

The R script itself is marked in blue here. I runs 100,000 random experiments on our input data. In each experiment, 7 (the number of rows in our dataset) evenly distributed random values. Only if the random value is below the given probability of the lead (which happens more rarely the smaller the value of the given probability is) the value is accounted. We then calculate quantiles and return the result as a SQL table.

Here is the result of this T-SQL statement:


How do we read this result? Here are some examples:

  • Line 6: In 10% of the cases, the value was below $52,000 and, consequently, in 90% of the cases, the value was above $52,000
  • Line 2:  In 99.5% of the cases that value was above $15,000
  • Line 5: In 95% of all cases the value was above $37,000

Or, in other words, at a confidence level of 90% we can assume to result in a value of at least $52,000 here. So this approach does not only give a single value but allows you to understand the expected result based on a given confidence.

Of course, T-SQL would not be a good choice to develop and test even a small R script as the one above. Usually when working with R you’re following a more interactive approach. I suggest developing the script in an interactive R tool like RStudio. In order to do so, I’m using same simple wrapper code to provide the data set from SQL Server as shown below:

db_connection <-odbcDriverConnect(
  paste("Driver={SQL Server Native Client 11.0}",
         "Trusted_Connection=yes", sep=";")

mydata<-sqlQuery(db_connection, "select CRMID, [Probability], [Value] from CRM")


eval<-function() {sum(ifelse(runif(min = 0, max=1, n=nrow(mydata))<=mydata$Probability, mydata$Value,0))}
q<-quantile(r,  probs = c(0.1, 0.5, 1, 2, 5, 10,25,50,100)/100)



Again, the code in blue is the final R code which we can copy over to our T-SQL function for production. The RStudio environment allows us to interactively develop the script. The surrounding code loads the data into a data frame with the same result as in our T-SQL sp_execute_external_script call. The method used by SQL Server is much more efficient, however for testing purposes the ODBC call is sufficient.

For example, we can plot a histogram in R:



This shows the distribution of our cases. For example, there are no cases that end with a value of between 500,000 and 1 million. Or we could plot a density function for the distribution:

h<-hist(r, plot = F, breaks=1000)
plot(x=h$breaks[-1], y=(100000-cumsum(h$counts))/100000, type="l", ylim=c(0,1))




While  R is mostly known for machine learning and advanced statistical calculations it may also be useful for simple simulations like the one above where we analyzed the distribution of leads in CRM and calculated an expected value based on a confidence. Doing the same in T-SQL would result in quite a lot of SQL code which in turn makes it more difficult to read and understand the procedure (compared to our short R script). Another option would be to put the same code in a CLR library but then we would have to deploy the library separately instead of keeping the code in the database. However, developing the R code with SQL Server tools like Management Studio is not much fun. Instead, we used a short wrapper around the code to develop and test the code in an interactive R GUI like RStudio.