Search This Blog


Sunday, April 24, 2016

Why doesn’t my decision tree recognize a simple pattern?

Data Mining algorithms are sometimes perceived to be extremly intelligent and established with many super powers, humans don’t have. In today’s post I’d like to show a simple example in which a mining model fails to detect a very simple rule in the data. This illustrates the need for feature engineering not only to enhance the data with new associations but also to include simple calculations, for example deltas instead of absolute values.

I’m starting with a very simple pattern of two variables (a, b) and a single output (y):


As you can see, “Y” equals TRUE, if “A” and “B” are identical (boolean operation is “not xor”). So, this is a very simple pattern, which should be easily detected by a data mining algorithm. In order to use this data as source for an algorithm, I manifolded this pattern 50 times to result in 200 rows of data with this pattern. In R this reads as follows:

> newdata <- data.frame(
> data <-"rbind", replicate(50, newdata, simplify = FALSE))
> data$y <- as.factor(data$a==data$b)

Let’s start with a simple decision tree based on this data:

> fit <- rpart(y~.,data)

The result shows that no split has been detected in the tree or – in other words – the rpart model didn’t understand our simple rule:

n= 200

node), split, n, loss, yval, (yprob)
      * denotes terminal node

1) root 200 100 FALSE (0.5000000 0.5000000) *

Instead, all observations are treated as “FALSE” in our root node. Consequently, if we ask the model to do the prediction, it always responds with the answer “false”:

> cbind(newdata,predict(fit,newdata, type="class"))

  a b predict(fit, newdata, type = "class")
1 1 0                                 FALSE
2 0 0                                 FALSE
3 1 1                                 FALSE
4 0 1                                 FALSE

The reason for this behavior is easy to understand if you look at the decision the model has to make for the first split. Since our data cases are perfectly balanced, neither “A” nor “B” seems to have any effect on the output. For example, if “A” equals zero you find “Y” having 50% true and 50% false (see the table from above). The same situation is seens with “A” equals one. And the same situation is true for “B”. So the feature selection algorithm decides that none of our input variables has a univerariate influence on the output variable.

> table(data$a, data$y)

0    50   50
1    50   50

So the reason for our decision tree not detecting the pattern is the unnatural perfectly balanced structure of the data. If we use a random forest instead, the draws wouldn’t be equally balanced and the algorithm would detect the pattern:

> fit <- randomForest(y~.,data)
> plot(fit)
> cbind(newdata,predict(fit,newdata, type="class"), predict(fit,newdata, type="prob"))

  a b predict(fit, newdata, type = "class") FALSE  TRUE
1 1 0                                 FALSE 0.728 0.272
2 0 0                                  TRUE 0.264 0.736
3 1 1                                  TRUE 0.218 0.782
4 0 1                                 FALSE 0.752 0.248

Ok, this looks much better. And also for our decision tree, if we take a sample (75% of the data rows resulting in 150 rows for this example), the result looks much better. Here’s the resulting tree:


As you can see, the leaf nodes are “perfect” with no error. So our pattern has been detected in this case. Of course you wouldn’t expect such a perfect distribution as I constructed for this demonstration, but still this effect could be an issue for your model. As it is unlikely to have such a balanced distribution in real life, it is as likely that you’re dealing with more than just two input variables. So, let’s add a third variable which approximates “Y” with a success rate of 90%:

> data$c<-as.factor(ifelse(data$y==TRUE,ifelse(runif(n=nrow(data))<=.9,1,0),ifelse(runif(n=nrow(data))<=.9,0,1)))

Now “C” is a good but not perfect approximation of “Y”. Let’s retrain the tree using “A”, “B” and “C” as input variables. As you can see from the tree visualization below, the algorithm chose the “easy” way and used our approximation as the only split condition, instead of detecting the pattern (as in the case where “C” was missing):


As a consequence our true prediction rate goes down from 100% (pattern detected) to about 90% (approximation variable used). 14 out of 150 cases failed. Ok, this is still good but the point is, that we’re getting better results if we think about the way splits are calculated and therefore add additional attributes (in this case we could have added “A==B” as a helper attribute) to support the decisions. There are many more examples:

  • Instead of absolute values (temperature 26°, 28°, 32°) it might be better to add relative (delta) values (temperature delta 2°, 4°)
  • Instead of two absolute variables (energy consumption and workload) it might be a good idea to also add the ratio (energy consumption / workload)
  • Difference/Change of a value compared to the average (or a regression) of a given time range before (for example, relative change of a value compared to the linear regression of the last 3 hours).
  • As shown in the example above: for two variables it might be good to add a measure of similarity (equal or almost equal)


Data Mining algorithms like the decision tree sometimes struggle with simple patterns in data. Instead of feeding the data “as it is” into the algorithm, it’s often a good idea to add some derived information (delta, ratio, correlation, measure of equality etc.) to support the model.

Sunday, April 3, 2016

Using R Server in SQL Server 2016 to calculate the expected value for CRM leads (Monte Carlo method)

SQL Server 2016

In this post I’m demonstrating a use case for the new R Server (formerly known as Revolution R Enterprise) in SQL Server 2016 to calculate the expected value for CRM leads using a Monte Carlo approach.

To keep things simple, let’s assume we have the following leads in our CRM:


For example, the chance for winning lead number 7 with $1,000,000 is 10%. So what is the amount of incoming orders we can plan with (assuming the probabiltity for the individual lead is correct)? A common approach is to use a weighted sum (sum over probability times value), which is easy to calculate in T-SQL:

select sum([Probability]*[Value]) ExpectedValue from CRM


While this approach works well with a large number of leads of similar size, for the example above we have to realize that $100,000 of the $256,000 result from the relatively unlikely win of lead number 7. And in fact, we could win or loose this lead which means a value of 0 or 1 million but nothing in between. So this approach may be misleading with skewed data.

Another approach is to use a threshold and only count the leads with a probability above the threshold. The query would look somewhat like this:

select sum([Value]) ExpectedValue from CRM where [Probability]>=.7


Here we’re only counting leads with a probability of at least 70%. We just need to be sure not to understand the threshold of 70% as a probability here. It would be wrong to interpret the result in a way like “with a probability of 70% we can expect incoming orders of at least $52,000”. The reason is that each lead can be a win or loss independently from the other leads.

So, what could be a more realistic method to estimate the expected value of the leads from above? One idea could be to simulate cases where each lead can be converted in an order or not at the individual probability of the lead. If we run say 100,000 such simulations we can look at the distribution of the results to get a better understanding of the resulting total. This approach is called Monte Carlo method. While we could implement this in T-SQL (for example look at an older blog post of mine about Monte Carlo in T-SQL), it’s easier to do so in R and with the new R Server capabilities in SQL Server 2016 we can better use this to do the calculation (see here for the basics about T-SQL stored procedures in R).

Let’s start with the resulting procedure code before I go into more details:

EXEC sp_execute_external_script
@language = N'R'
, @script = N'
eval<-function() {sum(ifelse(runif(min = 0, max=1, n=nrow(mydata))<=mydata$Probability, mydata$Value,0))}
q<-quantile(r,  probs = c(0.1, 0.5, 1, 2, 5, 10,25,50,100)/100)
, @input_data_1 = N'select  CRMID, Probability, Value from CRM'
, @input_data_1_name=N'mydata'
, @output_data_1_name = N'result'
  [value] float
  ,quantile nvarchar(10)

The R script itself is marked in blue here. I runs 100,000 random experiments on our input data. In each experiment, 7 (the number of rows in our dataset) evenly distributed random values. Only if the random value is below the given probability of the lead (which happens more rarely the smaller the value of the given probability is) the value is accounted. We then calculate quantiles and return the result as a SQL table.

Here is the result of this T-SQL statement:


How do we read this result? Here are some examples:

  • Line 6: In 10% of the cases, the value was below $52,000 and, consequently, in 90% of the cases, the value was above $52,000
  • Line 2:  In 99.5% of the cases that value was above $15,000
  • Line 5: In 95% of all cases the value was above $37,000

Or, in other words, at a confidence level of 90% we can assume to result in a value of at least $52,000 here. So this approach does not only give a single value but allows you to understand the expected result based on a given confidence.

Of course, T-SQL would not be a good choice to develop and test even a small R script as the one above. Usually when working with R you’re following a more interactive approach. I suggest developing the script in an interactive R tool like RStudio. In order to do so, I’m using same simple wrapper code to provide the data set from SQL Server as shown below:

db_connection <-odbcDriverConnect(
  paste("Driver={SQL Server Native Client 11.0}",
         "Trusted_Connection=yes", sep=";")

mydata<-sqlQuery(db_connection, "select CRMID, [Probability], [Value] from CRM")


eval<-function() {sum(ifelse(runif(min = 0, max=1, n=nrow(mydata))<=mydata$Probability, mydata$Value,0))}
q<-quantile(r,  probs = c(0.1, 0.5, 1, 2, 5, 10,25,50,100)/100)



Again, the code in blue is the final R code which we can copy over to our T-SQL function for production. The RStudio environment allows us to interactively develop the script. The surrounding code loads the data into a data frame with the same result as in our T-SQL sp_execute_external_script call. The method used by SQL Server is much more efficient, however for testing purposes the ODBC call is sufficient.

For example, we can plot a histogram in R:



This shows the distribution of our cases. For example, there are no cases that end with a value of between 500,000 and 1 million. Or we could plot a density function for the distribution:

h<-hist(r, plot = F, breaks=1000)
plot(x=h$breaks[-1], y=(100000-cumsum(h$counts))/100000, type="l", ylim=c(0,1))




While  R is mostly known for machine learning and advanced statistical calculations it may also be useful for simple simulations like the one above where we analyzed the distribution of leads in CRM and calculated an expected value based on a confidence. Doing the same in T-SQL would result in quite a lot of SQL code which in turn makes it more difficult to read and understand the procedure (compared to our short R script). Another option would be to put the same code in a CLR library but then we would have to deploy the library separately instead of keeping the code in the database. However, developing the R code with SQL Server tools like Management Studio is not much fun. Instead, we used a short wrapper around the code to develop and test the code in an interactive R GUI like RStudio.

Saturday, March 12, 2016

A common pitfall with correlations in timeseries

Correlations are often considered an important measure to understand the underlying (probably hidden) patterns in data sets. But as with other statistical measures, a complex situation (many variables, many rows of data) is reduced to a simple numeric value which might be problematic. For example, Ancombe’s quartet shows four totally different patterns of data with the same median, variance, correlation and linear regression.


Source: Wikipedia ('s_quartet)

Many articles about correlations also focus on the difference between correlations with and without causality. For example, there is a correlation between the sales of ice cream and the number of cases of sunburn. Having such a correlation allows to estimate one fact (e.g. cases of sunburn) with the observations of another fact (e.g. sales of ice cream). However, there is no direct causality between the two which means that you cannot control one fact with the other. For example, if you prohibit selling ice cream, you will not influence the cases of sunburn. The correlation however is still valid because the two fact share the same reason (hot temperature in summer because of high solar radiation). So, correlations can be divided in at least three classes:

  • causality (A => B)
  • correlation with a common reason (there exists a C with C=>A and C=>B and as a result A correlates with B)
  • correlation without a common reason / accidental correlations (there doesn’t exist any C with C=>A and C=>B and the correlation between A and B is purely random and not reliable)

While the first two are a reliable source for detecting patterns in data sets, the accidental correlations might lead to wrong results. You can find a lot of strange and even funny correlations, for example on this website (divorce rate in Main perfectly correlates with the consumption of margarine). Another very prominent example is the correlation between media mentions of Jennifer Lowrence and the Dow Jones, which was explained by Tom Fawcett in his article “Avoiding common mistakes with time series”:



As a consequence, we need to avoid these false correlations. But do these correlation appear likely or are they a very rare observation? Well, the sheer amount of internet sources showing funny examples as the one above could indicate that false correlations are not unlikely to happen.

In order to investigate this, I used generated random walk processes based on normally distributed random values. Random walk processes are processes were the random variable is the delta of two points (not the absolute value of the point). Many real life processes can be described as random walk processes, for example temperature (for machine sensors or for weather data), birth rates, even sales (at least to some extend) etc., so these processes are likely to be seen when analyzing time series data.

The setup for the experiment was to generate 10 variables with 100 observations each. Here is an example:



Next, I calculated the correlation matrix for the 10 variables to find strong positive/negative correlation:


As you can see we have some series with a strong correlation, especially with a strong positive correlation (deep blue = series 6 and series 10 with a correlation of .83) in this example. Plotting the series with the strong correlation looks like this:


Of course, not every combination correlates well, for example series 6 and 8 have a correlation near zero. But in this example with just 10 variables we found 2 with a good correlation although the series were created randomly. Since this was just a single random experiment with no statistical relevance, I repeated the experiment 100,000 times (Monte Carlo). The random number generator was Mersenne Twister, normally distributed random variables have been computed using Box-Muller. Here is the distribution of the best absolute correlation between two variables in each experiment:


It may be surprising that such a majority of our random experiments ended with at least two series with a good correlation. The result shows that in 99% of the experiments the best correlation was at least .7, in 88% it was at least .8 and in 34% at was at least .9.

This means that with a confidence of almost 90% we can expect a correlation of .8 for this random experiment or in other words, it is very likely to find a correlation of 10 random walk processes (with 100 observations each). This also means, that when analyzing data you should not blindly celebrate any correlation being found but in the opposite be very skeptical about correlations in your data. You might argue that this effect is a result of our relative short time series with only 100 values (“rows”) for 10 given variables. However, repeating the test for 10,000 values per variable gives exactly the same result. The most important influence is the number of variables. While the correlation is rather poor with 2 or 3 variables, you can be almost absolutely sure to find a correlation of above .8 with 25 variables as shown in the table and chart below (black line: Correlation of .8 and above, the shaded area shows the range from correlation .7 to .9).

image image



Correlations of variables in data sets are often put on the same level with meaning. One may think to have found a hidden pattern when a correlation is detected. However, this post shows that correlations are not rare, but are very likely to be discovered even in randomized data sets with only 10 variables. As a result, we need to carefully examine each correlation we find in a data set. A first step can be to use a holdout and to test the discovered correlations on the holdout data set (much like we use training and testing data sets in data mining).

Sunday, January 10, 2016

Fill down table in T-SQL (last-non-empty value)

SQL Server 2012-2016

A common task in data processing is to fill null values in a table with the latest existing value. For example, look at the following inventory table


Let’s assume we only get an inventory record, if the inventory changes but we want to fill the missing values (for example to create a snapshot fact table).

The final result should look like this with null values being filled with the last existing value:


In Power Query (or “Get and Transform” if you’re using Excel 2016) this is an easy task using the fill-down function, which does exactly what want:


So, how can we get the same effect in T-SQL? We could use a sub query, but in this post I’m using a window function. The idea is to generate row groups for which each row group starts with a not-null value and has a unique identifier. This allows us to use the first_value function for each group to get the existing value for that group.

In order to build the row groups I start with a change indicator for the table values. The idendicator is 1 if we have a balance, 0 if not. The resulting query is quite simple:

select *, case when UnitsBalance is null then 0 else 1 end ChangeIndicator from [dbo].[Inventory]


The next step is to create a running total over the change indicator. In fact, this running total already defines our row groups. I’m using the sum() over window function here. The query from above is converted into a common table expression (cte) as this is easier to read:

help1 as (
select *, case when UnitsBalance is null then 0 else 1 end ChangeIndicator
from [dbo].[Inventory]

select *, Sum(ChangeIndicator) over (order by DateKey) RowGroup from help1
order by Datekey


As you see, the number for the row group increments with each existing balance value which means that the first row always contains the number (with the exception of the very first line for which we don’t have a value at all). In order to finalize the query, we just need the first_value function to get the value of the first row per row group:

help1 as (
select *, case when UnitsBalance is null then 0 else 1 end ChangeIndicator
from [dbo].[Inventory]
, help2 as (
select *, Sum(ChangeIndicator) over (order by DateKey) RowGroup from help1
select *,

case when UnitsBalance is not null then UnitsBalance
else first_value(UnitsBalance) over (partition by RowGroup order by DateKey)
end UnitsBalanceFillDown

from help2
order by datekey



So this is exactly what we wanted (if you remove the columns that we just needed during the process from the final output, you get exactly the result from above).



Windows functions in T-SQL are a good solution to fill gaps in your data set similar to last-non-empty.
As I’m using the first_value function here, just as a reminder to avoid frustration: first_value order by date descending and last_value order by date ascending do not give the same result.

Friday, November 6, 2015

Using R in SQL Server 2016 CTP 3 to train and query a predictive model

SQL Server 2016 CTP 3
One of the exciting features of SQL Server 2016 is the R integration based on Revolution R Enterprise. The online documentation contains detailed information about the installation steps but only a few examples about how to use this new feature.
The installation process is explained here:
  • Install Revolution R Open for R Enterprise
  • Install Revolution R Enterprise
  • Register components in SQL Server and configure SQL Server to allow external scripts
The online documentation also contains two examples of how to call an R function from within SQL Server. The first example just returns a data frame from R (in this case the pre-defined iris data set):
CREATE PROC get_iris_dataset
EXEC sp_execute_external_script
@language = N'R'
, @script = N'iris_data <- iris;'
, @input_data_1 = N''
, @output_data_1_name = N'iris_data'
WITH RESULT SETS (("Sepal.Length" float not null, "Sepal.Width" float not null,"Petal.Length" float not null, "Petal.Width" float not null, "Species" varchar(100)));

You can see the call to the external script here. I changed the text color of the R statement to blue so it’s easier to be found within the SQL code. We have no input data for this simple case but return a table using the WITH RESULT SETS clause of the stored procedure.
Executing the procedure get_iris_dataset just shows the data from the R script in SQL Server. Here are the first rows of the result set.
I want to use this data as my training data, so I load the data into a SQL Server table. Usually, your training data would already be in a table so you wouldn’t need the procedure from above.
create table iris_data
"Sepal.Length" float not null,
"Sepal.Width" float not null,
"Petal.Length" float not null,
"Petal.Width" float not null,
"Species" varchar(100)
INSERT INTO iris_data Exec get_iris_dataset

So, this copies the data from R into a SQL Server table. Next, I’d like to train an R model based on this data. Again, this code can be found in the documentation:
CREATE PROC generate_iris_model
EXEC sp_execute_external_script
@language = N'R'
, @script = N'
irismodel <-naiveBayes(iris_data[,1:4], iris_data[,5]);
trained_model <- data.frame(payload = as.raw(serialize(irismodel, connection=NULL)));
, @input_data_1 = N'select "Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species" from iris_data'
, @input_data_1_name = N'iris_data'
, @output_data_1_name = N'trained_model'
WITH RESULT SETS ((model varbinary(max)));

Again, the R code is in blue. If you haven’t installed the library e1071 in R you have to do so before running this procedure. To so, open the R console and run install.packages(e1071) as described here.
The interesting thing about this procedure is, that we actually return the trained model as a varbinary object. In R the object class is “naiveBayes” but here it is serialized to a raw data frame that is returned from the function. You could also save the model in R (using the ‘save’ command) but it’s still interesting to see you we can store the model in SQL Server. To do so, I create a simple table and persist the trained model there:
CREATE TABLE [dbo].[model](
[model] [varbinary](max) NULL
insert into model exec generate_iris_model

So now we’re getting to the more interesting part of the game which took me some time to figure out. If we want to query the existing model, we have to de-serialize it and use a prediction function. For simplicity I’m using the same data set (iris) again to see how well the model performs on known data. First we load the model into a variable (again varbinary) and then we pass the model to the external script together with the data on which we like to do our prediction. Here is the final code:
declare @model varbinary(MAX)
select @model=model from model

EXEC sp_execute_external_script
@language = N'R'
, @script = N'
model <-unserialize(as.raw(iris_model))
pred<-predict(model, iris_data)
result<-cbind(iris_data, pred)
, @input_data_1 = N'select "Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species" from iris_data'
, @input_data_1_name = N'iris_data'
, @output_data_1_name = N'result'
, @params = N'@iris_model varbinary(MAX)'
, @iris_model= @model
WITH RESULT SETS (("Sepal.Length" float not null, "Sepal.Width" float not null
,"Petal.Length" float not null, "Petal.Width" float not null, "Species" varchar(100), "SpeciesPredicted" varchar(100)))

As you can see, I’m passing the data to the function using the @input_data_1 and @input_data_1_name parameters. Then I’m passing the model using the @params parameter of the script.
Here is a subset of the resulting rows:
As you can see, the model predicted the correct species for most of the cases although there are still some cases that haven’t been predicted correctly.
Actually the R script we executed here was quiet simple as shown here:
iris_data <- iris;
irismodel <-naiveBayes(iris_data[,1:4], iris_data[,5]);
result<-predict(irismodel, iris_data[,1:4])
cbind(iris_data, result)

However, the examples from above show how you can train a model, persist the trained model in SQL Server and then use this model with whatever data you like. So this makes it much easier to deploy R scripts into production than calling rscript.exe and use the RODBC interface to communicate with SQL Server.
Just a warning though: The example from above are run on SQL Server 2016 CTP 3 which is not the final product. So when SQL Server 2016 goes ready to market some of the functionality might still change.

Sunday, October 18, 2015

Trend in times series analysis


Time series analysis is widely used to forecast logistics, production or other business processes. Usually you want to understand if there is a trend or a seasonality in the time series. This could support forecasting and planning. However, there are different approaches to understanding trend. While trend often refers to historical changes of data, for me, trend is nothing that happens in the past (this is more like a historical drift), but trend implies a prediction of future behavior. Or, in other words, a positive trend means that it is likely that the growth continues.

Let’s illustrate this with a simple example:


Hmm, this clearly looks like there is a trend. In order to build up confidence, let’s add a linear regression for this graph:


Quite impressive. We could also train an auto arima model and do some forecasting of our data:


We clearly see that even a sophisticated arima model found a trend (drift) and forecasted the chart accordingly.

Trend detected! Case solved?

Talking about a trend in data always implies that it is likely that the process follows the trend (at least for the near future). Up till now, we only did some basic chart analysis but I wrote nothing about the source, driver or generator of the data. Without knowing details about the nature of our time series we are on dangerous ground when it comes to forecasting. Please don’t be disappointed, but in my case I used a so called random walk model to plot the graph from above. In a random walk model, the value for the next period is calculated relative to the value of the current period. The difference between one point and the other in my case is just a normally distributed random number. Random walk processes often look like they have trend or even seasonality. But for this example, the chart may go to either direction from here, it’s purely random. Or in other words, it’s dangerous to speak of a trend in this case. So whenever we look at trend detection we have to understand the reason why the trend is likely to extend to the future.

Ok, no trend! Case solved?

Obviously, since the chart was generated using a random number generator, we cannot really declare this a trend. But how likely is it, that we get such a drift in our random walk process? In order to understand this, I plotted 1,000 random walk processes (each with 100 steps). The line from above is actually one of them. Here’s the result:


As you can see, most of them are concentrated on a relatively small window indicated by the red lines (which are just plotted based on a square root function). Some of them went up to about +30, some down to about –30 but that’s not the major part.

If we take a look at the distribution of final values (the values we see after 100 steps), the histogram looks like this:


This looks pretty much like a normal distribution with a mean of 0 and a standard deviation of 10 which is again the square root of 100, so this also explains the red lines from the chart above (one standard deviation). What can we do with this information? Well, first we could take a look at the cumulated distribution (and since I’m not much interested in positive or negative trend I used absolute values):


How do we read this chart? It just gives us the percentage of cases which end outside a given tolerance from zero. For example, only 5% of all cases (50 out of 1,000) ended at a value bigger than 20.5 or smaller than -20.5 (the .95 confidence marker) and only 1% of all cases (10 out of 1,000) ended distance of more than 24.5 from zero (the .99 confidence marker). Our chart from above actually ended at 29.7 which hits the 0.001 confidence level. In other words, I used the single worst outlier (1 out of 1,000) of my set of lines to plot the chart at the beginning of this post.

I’m … eh … confused …

So, what does this all lead up to? Trend or no trend? Well, since we know that I used a random number generator, frankly there’s no reason to believe that the trend continues and therefore it makes no sense to speak of a trend. However, if the chart was generated using real life data and we know that the underlying process follows a random walk model, it would be extremely unlikely that there is no drift or trend in the data. In our case, only 1 out of 1000 cases would behave in such an abnormal way otherwise. So, in this case I’d be pretty sure there is a trend in the series of data.



In order to detect a trend in a time series, make sure that you fully understand where the data comes, how the data is generated and what the characteristics of the time series is. Is it oscillating, is it a random walk process (first order derivation) or is the driver even at a  higher derivation (you could think of a random accelerating process)? Only if you understand the characteristics and the driver (changing component) it makes sense to look for a trend and analyze the likelihood that you can rely on the trend for future development.

Sunday, June 21, 2015

Importing OpenStreetMap data or other large XML files with Integration Services (SSIS)

SQL Server 2005-2016

Last year I wrote about importing data from OpenStreetMap using PowerQuery. You can find the blog post here. In that post I loaded a relatively small area as an example. The reason was, that the XML DOM parser in PowerQuery loads the full XML document into memory before being able to access it. If you need to process larger areas, the approach with the DOM parser won’t work. For example, in a recent project I had to load all addresses in Berlin. I took the OSM file from The file is bzip-compressed down to 68 MB. Once downloaded it expands to about 940 MB (yes, XML is very talkative and compresses very well…). At first, I tried to load the file using PowerQuery and the script from my blog post. But since the DOM parser creates a memory consuming object model, it failed at about 30% of the load and 20 minutes with an out-of-memory error (using 25GB).

So, if you need to load a large XML file in general, you’re well advised to use a a different parsing approach. So this blog post is about using a lightweight XML parser for reading the file from above. For example, a SAX parser reads the file once from the beginning to the end (forward only) firing events. In C# the XMLReader follows a similar approach. Both parsers do not allow you to search the XML file randomly (for example with XPATH), to insert elements or to go back and forth. But what they do is that they read the file in a very efficient way.

Let’s have a look at a typical node element from the OpenStreetMap OSM file:

<node id="256922190" visible="true" version="7" changeset="29687333" timestamp="2015-03-23T20:13:41Z" user="atpl_pilot"
uid="881429" lat="52.5379749" lon="13.2888659">
  <tag k="addr:city" v="Berlin"/>
  <tag k="addr:country" v="DE"/>
  <tag k="addr:housenumber" v="24"/>
  <tag k="addr:postcode" v="13627"/>
  <tag k="addr:street" v="Halemweg"/>
  <tag k="addr:suburb" v="Charlottenburg-Nord"/>
  <tag k="amenity" v="library"/>
  <tag k="layer" v="1"/>
  <tag k="name" v="Stadtteilbibliothek Halemweg"/>
  <tag k="ref:isil" v="none"/>

You can clearly see the geo coordinates (latitude and longitude) as well as the address (in key/value pairs below the node). I wasn’t interested in points of interest (POIs) but you can also see that the amenity key contains information about the point of interest. In this case, we have a library.

Since PowerQuery uses the DOM parser and because I wanted the import process to run scheduled I used Integration Services (SSIS) to load the file. First I had to create a database table like this:

CREATE TABLE [dbo].[OSMAddress](
    [latitude] [real] NULL,
    [longitude] [real] NULL,
    [street] [nvarchar](255) NULL,
    [housenumber] [nvarchar](20) NULL,
    [postcode] [nvarchar](20) NULL,
    [city] [nvarchar](255) NULL,
    [country] [nvarchar](2) NULL

Next, I used a very simple data flow to populate the table:


The main logic is contained in the script component. This is the code for the CreateNewOutputRows event in the script component (please note that his code is without any error handling for simplicity here):

public override void CreateNewOutputRows()
    float latitude = -1;
    float longitude = -1;

    String city = null;
    String country = null;
    String street = null;
    String housenumber = null;
    String postcode = null;

    using (XmlReader reader = XmlReader.Create(Variables.OpenStreetmapFile))
        while (reader.Read())
            switch (reader.NodeType)
                case XmlNodeType.Element: 
                    if (reader.Name.Equals("node"))
                        if (reader.HasAttributes)
                            String lt = reader.GetAttribute("lat");
                            String lg = reader.GetAttribute("lon");

                            if (lt != null && lg != null)
                                if (!(float.TryParse(lt, out latitude) && float.TryParse(lg, out longitude)))
                    else if (reader.Name.Equals("tag"))
                        if (latitude > -1 && longitude > -1)
                            String k = reader.GetAttribute("k");
                            String v = reader.GetAttribute("v");
                            if (k!=null && v!=null) {
                                switch (k)
                                    case "addr:city":        city = v; break;
                                    case "addr:country":     country = v; break;
                                    case "addr:housenumber": housenumber = v; break;
                                    case "addr:postcode":    postcode = v; break;
                                    case "addr:street":     street =v; break;                                                   

                case XmlNodeType.EndElement:
                    if (reader.Name.Equals("node"))
                        if (latitude > -1 && longitude > -1 && street != null && city!=null && housenumber!=null)
                   = city.Substring(0, Math.Min(city.Length,255));
                   = (country==null)?"":country.Substring(0, Math.Min(country.Length,2));
                            Output0Buffer.housenumber = housenumber.Substring(0, Math.Min(housenumber.Length,20));
                            Output0Buffer.latitude = latitude;
                            Output0Buffer.longitude = longitude;
                            Output0Buffer.postcode = (postcode==null)?"":postcode.Substring(0, Math.Min(postcode.Length,20));
                            Output0Buffer.street = street.Substring(0, Math.Min(street.Length,255));
                        latitude = longitude = -1;
                        street = postcode = housenumber = country = city = null;

The package took about 10 seconds to extract all of the addresses from the OSM file into the database table: A quite impressive result compared to the 20 minutes without success from above. So this clearly shows the advantage of XML parsers like SAX or XMLReader when it comes to reading larger files. If you go for larger areas it’s better to directly stream from the bzip2 compressed file instead of decompressing the file first. For example, the OSM file for Germany (complete) is about 4GB in size (bzip2 compressed) and expands to a single 48GB XML-file. I used SharpZipLib to decompress the file on the fly which saves a lot of disk space and IO. Using this approach I created the following visualization showing the concentration of fuel stations in Germany:


Of course you could retrieve much more information from the OSM file than I did here. For example, you can read borders (city, state etc.), points of interests (restaurants, airports etc.), sometimes even the location of trees. The file format is described here: