Search This Blog

Friday, November 6, 2015

Using R in SQL Server 2016 CTP 3 to train and query a predictive model

SQL Server 2016 CTP 3
One of the exciting features of SQL Server 2016 is the R integration based on Revolution R Enterprise. The online documentation contains detailed information about the installation steps but only a few examples about how to use this new feature.
The installation process is explained here:
  • Install Revolution R Open for R Enterprise
  • Install Revolution R Enterprise
  • Register components in SQL Server and configure SQL Server to allow external scripts
The online documentation also contains two examples of how to call an R function from within SQL Server. The first example just returns a data frame from R (in this case the pre-defined iris data set):
CREATE PROC get_iris_dataset
EXEC sp_execute_external_script
@language = N'R'
, @script = N'iris_data <- iris;'
, @input_data_1 = N''
, @output_data_1_name = N'iris_data'
WITH RESULT SETS (("Sepal.Length" float not null, "Sepal.Width" float not null,"Petal.Length" float not null, "Petal.Width" float not null, "Species" varchar(100)));

You can see the call to the external script here. I changed the text color of the R statement to blue so it’s easier to be found within the SQL code. We have no input data for this simple case but return a table using the WITH RESULT SETS clause of the stored procedure.
Executing the procedure get_iris_dataset just shows the data from the R script in SQL Server. Here are the first rows of the result set.
I want to use this data as my training data, so I load the data into a SQL Server table. Usually, your training data would already be in a table so you wouldn’t need the procedure from above.
create table iris_data
"Sepal.Length" float not null,
"Sepal.Width" float not null,
"Petal.Length" float not null,
"Petal.Width" float not null,
"Species" varchar(100)
INSERT INTO iris_data Exec get_iris_dataset

So, this copies the data from R into a SQL Server table. Next, I’d like to train an R model based on this data. Again, this code can be found in the documentation:
CREATE PROC generate_iris_model
EXEC sp_execute_external_script
@language = N'R'
, @script = N'
irismodel <-naiveBayes(iris_data[,1:4], iris_data[,5]);
trained_model <- data.frame(payload = as.raw(serialize(irismodel, connection=NULL)));
, @input_data_1 = N'select "Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species" from iris_data'
, @input_data_1_name = N'iris_data'
, @output_data_1_name = N'trained_model'
WITH RESULT SETS ((model varbinary(max)));

Again, the R code is in blue. If you haven’t installed the library e1071 in R you have to do so before running this procedure. To so, open the R console and run install.packages(e1071) as described here.
The interesting thing about this procedure is, that we actually return the trained model as a varbinary object. In R the object class is “naiveBayes” but here it is serialized to a raw data frame that is returned from the function. You could also save the model in R (using the ‘save’ command) but it’s still interesting to see you we can store the model in SQL Server. To do so, I create a simple table and persist the trained model there:
CREATE TABLE [dbo].[model](
[model] [varbinary](max) NULL
insert into model exec generate_iris_model

So now we’re getting to the more interesting part of the game which took me some time to figure out. If we want to query the existing model, we have to de-serialize it and use a prediction function. For simplicity I’m using the same data set (iris) again to see how well the model performs on known data. First we load the model into a variable (again varbinary) and then we pass the model to the external script together with the data on which we like to do our prediction. Here is the final code:
declare @model varbinary(MAX)
select @model=model from model

EXEC sp_execute_external_script
@language = N'R'
, @script = N'
model <-unserialize(as.raw(iris_model))
pred<-predict(model, iris_data)
result<-cbind(iris_data, pred)
, @input_data_1 = N'select "Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species" from iris_data'
, @input_data_1_name = N'iris_data'
, @output_data_1_name = N'result'
, @params = N'@iris_model varbinary(MAX)'
, @iris_model= @model
WITH RESULT SETS (("Sepal.Length" float not null, "Sepal.Width" float not null
,"Petal.Length" float not null, "Petal.Width" float not null, "Species" varchar(100), "SpeciesPredicted" varchar(100)))

As you can see, I’m passing the data to the function using the @input_data_1 and @input_data_1_name parameters. Then I’m passing the model using the @params parameter of the script.
Here is a subset of the resulting rows:
As you can see, the model predicted the correct species for most of the cases although there are still some cases that haven’t been predicted correctly.
Actually the R script we executed here was quiet simple as shown here:
iris_data <- iris;
irismodel <-naiveBayes(iris_data[,1:4], iris_data[,5]);
result<-predict(irismodel, iris_data[,1:4])
cbind(iris_data, result)

However, the examples from above show how you can train a model, persist the trained model in SQL Server and then use this model with whatever data you like. So this makes it much easier to deploy R scripts into production than calling rscript.exe and use the RODBC interface to communicate with SQL Server.
Just a warning though: The example from above are run on SQL Server 2016 CTP 3 which is not the final product. So when SQL Server 2016 goes ready to market some of the functionality might still change.

Sunday, October 18, 2015

Trend in times series analysis


Time series analysis is widely used to forecast logistics, production or other business processes. Usually you want to understand if there is a trend or a seasonality in the time series. This could support forecasting and planning. However, there are different approaches to understanding trend. While trend often refers to historical changes of data, for me, trend is nothing that happens in the past (this is more like a historical drift), but trend implies a prediction of future behavior. Or, in other words, a positive trend means that it is likely that the growth continues.

Let’s illustrate this with a simple example:


Hmm, this clearly looks like there is a trend. In order to build up confidence, let’s add a linear regression for this graph:


Quite impressive. We could also train an auto arima model and do some forecasting of our data:


We clearly see that even a sophisticated arima model found a trend (drift) and forecasted the chart accordingly.

Trend detected! Case solved?

Talking about a trend in data always implies that it is likely that the process follows the trend (at least for the near future). Up till now, we only did some basic chart analysis but I wrote nothing about the source, driver or generator of the data. Without knowing details about the nature of our time series we are on dangerous ground when it comes to forecasting. Please don’t be disappointed, but in my case I used a so called random walk model to plot the graph from above. In a random walk model, the value for the next period is calculated relative to the value of the current period. The difference between one point and the other in my case is just a normally distributed random number. Random walk processes often look like they have trend or even seasonality. But for this example, the chart may go to either direction from here, it’s purely random. Or in other words, it’s dangerous to speak of a trend in this case. So whenever we look at trend detection we have to understand the reason why the trend is likely to extend to the future.

Ok, no trend! Case solved?

Obviously, since the chart was generated using a random number generator, we cannot really declare this a trend. But how likely is it, that we get such a drift in our random walk process? In order to understand this, I plotted 1,000 random walk processes (each with 100 steps). The line from above is actually one of them. Here’s the result:


As you can see, most of them are concentrated on a relatively small window indicated by the red lines (which are just plotted based on a square root function). Some of them went up to about +30, some down to about –30 but that’s not the major part.

If we take a look at the distribution of final values (the values we see after 100 steps), the histogram looks like this:


This looks pretty much like a normal distribution with a mean of 0 and a standard deviation of 10 which is again the square root of 100, so this also explains the red lines from the chart above (one standard deviation). What can we do with this information? Well, first we could take a look at the cumulated distribution (and since I’m not much interested in positive or negative trend I used absolute values):


How do we read this chart? It just gives us the percentage of cases which end outside a given tolerance from zero. For example, only 5% of all cases (50 out of 1,000) ended at a value bigger than 20.5 or smaller than -20.5 (the .95 confidence marker) and only 1% of all cases (10 out of 1,000) ended distance of more than 24.5 from zero (the .99 confidence marker). Our chart from above actually ended at 29.7 which hits the 0.001 confidence level. In other words, I used the single worst outlier (1 out of 1,000) of my set of lines to plot the chart at the beginning of this post.

I’m … eh … confused …

So, what does this all lead up to? Trend or no trend? Well, since we know that I used a random number generator, frankly there’s no reason to believe that the trend continues and therefore it makes no sense to speak of a trend. However, if the chart was generated using real life data and we know that the underlying process follows a random walk model, it would be extremely unlikely that there is no drift or trend in the data. In our case, only 1 out of 1000 cases would behave in such an abnormal way otherwise. So, in this case I’d be pretty sure there is a trend in the series of data.



In order to detect a trend in a time series, make sure that you fully understand where the data comes, how the data is generated and what the characteristics of the time series is. Is it oscillating, is it a random walk process (first order derivation) or is the driver even at a  higher derivation (you could think of a random accelerating process)? Only if you understand the characteristics and the driver (changing component) it makes sense to look for a trend and analyze the likelihood that you can rely on the trend for future development.

Sunday, June 21, 2015

Importing OpenStreetMap data or other large XML files with Integration Services (SSIS)

SQL Server 2005-2016

Last year I wrote about importing data from OpenStreetMap using PowerQuery. You can find the blog post here. In that post I loaded a relatively small area as an example. The reason was, that the XML DOM parser in PowerQuery loads the full XML document into memory before being able to access it. If you need to process larger areas, the approach with the DOM parser won’t work. For example, in a recent project I had to load all addresses in Berlin. I took the OSM file from The file is bzip-compressed down to 68 MB. Once downloaded it expands to about 940 MB (yes, XML is very talkative and compresses very well…). At first, I tried to load the file using PowerQuery and the script from my blog post. But since the DOM parser creates a memory consuming object model, it failed at about 30% of the load and 20 minutes with an out-of-memory error (using 25GB).

So, if you need to load a large XML file in general, you’re well advised to use a a different parsing approach. So this blog post is about using a lightweight XML parser for reading the file from above. For example, a SAX parser reads the file once from the beginning to the end (forward only) firing events. In C# the XMLReader follows a similar approach. Both parsers do not allow you to search the XML file randomly (for example with XPATH), to insert elements or to go back and forth. But what they do is that they read the file in a very efficient way.

Let’s have a look at a typical node element from the OpenStreetMap OSM file:

<node id="256922190" visible="true" version="7" changeset="29687333" timestamp="2015-03-23T20:13:41Z" user="atpl_pilot"
uid="881429" lat="52.5379749" lon="13.2888659">
  <tag k="addr:city" v="Berlin"/>
  <tag k="addr:country" v="DE"/>
  <tag k="addr:housenumber" v="24"/>
  <tag k="addr:postcode" v="13627"/>
  <tag k="addr:street" v="Halemweg"/>
  <tag k="addr:suburb" v="Charlottenburg-Nord"/>
  <tag k="amenity" v="library"/>
  <tag k="layer" v="1"/>
  <tag k="name" v="Stadtteilbibliothek Halemweg"/>
  <tag k="ref:isil" v="none"/>

You can clearly see the geo coordinates (latitude and longitude) as well as the address (in key/value pairs below the node). I wasn’t interested in points of interest (POIs) but you can also see that the amenity key contains information about the point of interest. In this case, we have a library.

Since PowerQuery uses the DOM parser and because I wanted the import process to run scheduled I used Integration Services (SSIS) to load the file. First I had to create a database table like this:

CREATE TABLE [dbo].[OSMAddress](
    [latitude] [real] NULL,
    [longitude] [real] NULL,
    [street] [nvarchar](255) NULL,
    [housenumber] [nvarchar](20) NULL,
    [postcode] [nvarchar](20) NULL,
    [city] [nvarchar](255) NULL,
    [country] [nvarchar](2) NULL

Next, I used a very simple data flow to populate the table:


The main logic is contained in the script component. This is the code for the CreateNewOutputRows event in the script component (please note that his code is without any error handling for simplicity here):

public override void CreateNewOutputRows()
    float latitude = -1;
    float longitude = -1;

    String city = null;
    String country = null;
    String street = null;
    String housenumber = null;
    String postcode = null;

    using (XmlReader reader = XmlReader.Create(Variables.OpenStreetmapFile))
        while (reader.Read())
            switch (reader.NodeType)
                case XmlNodeType.Element: 
                    if (reader.Name.Equals("node"))
                        if (reader.HasAttributes)
                            String lt = reader.GetAttribute("lat");
                            String lg = reader.GetAttribute("lon");

                            if (lt != null && lg != null)
                                if (!(float.TryParse(lt, out latitude) && float.TryParse(lg, out longitude)))
                    else if (reader.Name.Equals("tag"))
                        if (latitude > -1 && longitude > -1)
                            String k = reader.GetAttribute("k");
                            String v = reader.GetAttribute("v");
                            if (k!=null && v!=null) {
                                switch (k)
                                    case "addr:city":        city = v; break;
                                    case "addr:country":     country = v; break;
                                    case "addr:housenumber": housenumber = v; break;
                                    case "addr:postcode":    postcode = v; break;
                                    case "addr:street":     street =v; break;                                                   

                case XmlNodeType.EndElement:
                    if (reader.Name.Equals("node"))
                        if (latitude > -1 && longitude > -1 && street != null && city!=null && housenumber!=null)
                   = city.Substring(0, Math.Min(city.Length,255));
                   = (country==null)?"":country.Substring(0, Math.Min(country.Length,2));
                            Output0Buffer.housenumber = housenumber.Substring(0, Math.Min(housenumber.Length,20));
                            Output0Buffer.latitude = latitude;
                            Output0Buffer.longitude = longitude;
                            Output0Buffer.postcode = (postcode==null)?"":postcode.Substring(0, Math.Min(postcode.Length,20));
                            Output0Buffer.street = street.Substring(0, Math.Min(street.Length,255));
                        latitude = longitude = -1;
                        street = postcode = housenumber = country = city = null;

The package took about 10 seconds to extract all of the addresses from the OSM file into the database table: A quite impressive result compared to the 20 minutes without success from above. So this clearly shows the advantage of XML parsers like SAX or XMLReader when it comes to reading larger files. If you go for larger areas it’s better to directly stream from the bzip2 compressed file instead of decompressing the file first. For example, the OSM file for Germany (complete) is about 4GB in size (bzip2 compressed) and expands to a single 48GB XML-file. I used SharpZipLib to decompress the file on the fly which saves a lot of disk space and IO. Using this approach I created the following visualization showing the concentration of fuel stations in Germany:


Of course you could retrieve much more information from the OSM file than I did here. For example, you can read borders (city, state etc.), points of interests (restaurants, airports etc.), sometimes even the location of trees. The file format is described here:

Friday, May 29, 2015

Polybase in SQL Server 2016 CTP 2

SQL Server 2016 CTP2

One of the exciting new features in SQL Server 2016 that is already available in the public preview version (CTP 2) is Polybase, the transparent access bridge from SQL to Hadoop. This functionality has been available in the Microsoft Analytics Platform System (APS) since version 2012 and has now made it’s way into SQL Server. You can find a lot more information in this blog post for example:
This blog post covers the first steps for using Polybase in SQL Server 2016 community preview 2 (CTP 2).
First of all, you need to install Polybase functionality during the installation process by checking the appropriate option:

This installs two services for Polybase as shown below:

Also, when opening a database in SQL Server Management Studio you’ll see some new objects:

External tables are your gateway to hadoop. Later in this post we will create an external table to a Hadoop file which can then be used mostly like a normal SQL Server table.
External data sources define the connection to the Hadoop cluster (more precisely to the Hadoop file system, HDFS), while external file formats are used to define the structure of the file on the Hadoop cluster.
Before we can start, we need to configure SQL Server for Hadoop connectivity. The following code is an example:

EXEC sp_configure 'hadoop connectivity', 5;

You can use the following configuration values:
0 no Hadoop connectivity (default)
3 Enable connectivity to Cloudera CDH for Linux
4 Enable connectivity to Hortonworks Data Platform for Windows Server (HDP), HDInsight on Analytics Platform System, or HDInsight’s Microsoft Azure blob storage
5 Enable connectivity to Hortonworks Data Platform (HDP)for Linux

In my case, I’m using a Hortonworks Sandbox on Linux, so I’m using the configuration value 5.
Also, the user that is used to access Hadoop has to be configured in the Polybase configuration file which can be found in the Binn\Polybase\Settings sub directory of the instance, in my case

C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\Binn\Polybase\Settings\Hadoop.config

The default user is pdw_user which I changed to ‘hue’ here:

To make all these changes active, I had to restart SQL Server and the two Polybase services.

Now, let’s create an external data source. You can create a code snippet from the context menu of the external data sources folder. Here is the code I’m using for my sandbox:

    LOCATION = 'hdfs://'

Next, we need to create an external file format. I’d like to access the tab delimited file sample_07 from Hive which is shipped with the sandbox. Therefore the external file format looks like this:

        FIELD_TERMINATOR = '\t',
        DATE_FORMAT = 'MM/dd/yyyy'

The date format is not needed for my case (there is no date column in this file) but just there to show the functionality.

The two new server objects now appear in the “External Resources” section of the database.

Next, we can create the external table itself using the defined data source and file format.

create external table sample_07
    code nvarchar(255),
    description nvarchar(255),
    total_emp int,
    salary nvarchar(255)
    LOCATION = '/apps/hive/warehouse/sample_07',
    REJECT_TYPE = value,

The new external table appears in the “External Tables” folder of the database.

We can now query the table like any other table in the database, for example:

select * from Sample_07

Here is another example

select left(code,2) code2, sum(total_emp) total_emp, count(*) Count
from Sample_07
group by left(code,2)

And of course you can also write queries with joins between Polybase external tables and local SQL Server tables which gives a transparant SQL query experience to Hadoop data.

Final notes
  • The functionality and syntax may change in the final product, so this only applies to the technical preview 2 (CTP2)
  • Currently, writing to a hadoop file is not supported (there should be something like the “create external table as select” syntax in the APS)
  • This post is just a brief overview. There are many more options (for example more file formats, compression etc.) that can be used. A good orientation can be found in the Microsoft APS support tools download which can be found here:

Friday, May 15, 2015

Incremental data loads in Microsoft Power Query for Excel

Power Query

Power Query is a great and flexible tool for getting and transforming data from different sources into Excel or Power Pivot. The standard procedure for Power Query is to read a full table and to replace the destination with the result of the current query. This is exactly what we need for most use cases of Power Query. However, if you like to add new data to existing data you can still use Power Query but you have to follow a slightly different approach.

For example, let’s say we want to create a list of the blog posts while the RSS feed only delivers the most recent posts. In this case we would need to add the the query results to the existing data.

Let’s start with my olap blog. The RSS feed delivers only the last 25 entries of my blog. In order to load this with Power Query, I’m using the load from web function:


The dialog asks for the URL (I’m using my blog’s mirror here at


If Power Query does not automatically detect this as an XML table you can insert the XML.Tables(…) function as shown below:


The items are stored in the last column (entry table) which can be expanded by clicking on image button right between “entry”. For my example I’m only interested in the publication data and the title.


By expanding this table we get one row per blog post. In order the get the title, I’m expanding the last column (in my case title.1 since there is already a column title for the title of the blog) to its text value:


Finally, I changed the type of the first column to DateTime and I renamed the columns as shown below:


The full script generated by Power Query so far looks like this:

    Source = Xml.Tables(Web.Contents("")),
    #"Expand entry" = Table.ExpandTableColumn(Source, "entry", {"published", "title"}, {"published", "title.1"}),
    #"Expand title.1" = Table.ExpandTableColumn(#"Expand entry", "title.1", {"Element:Text"}, {"Element:Text"}),
    #"Removed Columns" = Table.RemoveColumns(#"Expand title.1",{"id", "updated", "category", "title", "subtitle", "link", "author", "generator", ""}),
    #"Split Column by Position" = Table.SplitColumn(#"Removed Columns","published",Splitter.SplitTextByPositions({0, 19}, false),{"published.1", "published.2"}),
    #"Changed Type" = Table.TransformColumnTypes(#"Split Column by Position",{{"published.1", type datetime}, {"published.2", type text}}),
    #"Removed Columns1" = Table.RemoveColumns(#"Changed Type",{"published.2"}),
    #"Renamed Columns" = Table.RenameColumns(#"Removed Columns1",{{"published.1", "Date"}, {"Element:Text", "Title"}})
    #"Renamed Columns"

We can now load this result to the Workbook by clicking the “Close & Load” button. Now, here comes the interesting part. Up till now, the RSS feed only delivers the last 25 rows, so whenever there is a new post, the oldest post would be removed from the table. Let’s say we’d like to add new posts to the table.

To do so, I’m creating a new query named “Existing data” from the data we just loaded into the workbook:


Using the “Save & Load To …” function in the ribbon, we can choose to only create a connection (no data is actually moved here):


The workbook queries pane now looks like this:


We can now edit the first query (“BlogPosts”) again. In the home tab of the Power Query ribbon you can find the options for combining queries.


In my case, I simple decide to append the new data to the existing one. If you have a key to match the same rows, you could also do a merge based on the key column (for example if there are updated rows).


The result is that we now have 25 rows from the existing table plus 25 rows recently read from the blog. At this point of time, the entries are identical so we have 25 duplicates. I remove the duplicated rows here using the “Remove duplicates” function.


We can now save & load the query again. In order to demonstrate the effect, I’m pasting the remaining blog posts (which cannot be pulled by the RSS feed) into the Excel table:


While full loads usually are the most usefull approach for loading data with Power Query, you can also build incremental load procedures by joining or appending the new data to the existing data.

Sunday, April 12, 2015

Importing data from SAP using the SAP BW Delta Queue and Theobald Software Xtract IS

SQL Server 2005-2014


Importing data from SAP is a common task in many BI projects. Since Integration Services (SSIS) doesn’t have out of the box support for dealing with SAP as a data source, third party components should be considered to simplify this task. The components we’re using and which I am writing about here are supplied from Theobald Software and are called Xtract IS. A fully functional trial version is available and the examples in this blog post can be reproduced using this trial version. Xtract IS contains many different options to access a SAP system, for example

  • read table (through rfc_read_table or by using a custom function included in Xtract IS that also supports compression) or sap queries
  • call remote functions (RFCs) and BAPIs
  • Retrieving data from legacy reports
  • Connection to BW via OHS
  • Connection to SAP extractors (aka “delta queue” or “generic data source”)
  • much more…

For this short blog series I’d like to concentrate on the extractor interface which is optimal for reading large tables from SAP. Especially the support for the delta detection is important when loading tables or structures with maybe many billion rows of data. In order to avoid misunderstandings, Theobald’s delta queue component does not require SAP BW to be in place, but it basically acts like BW and uses the same interface to push the data to SSIS in a very efficient way.

Since this post got a little longer, let’s start with an agenda:



  • How do I get access to a SAP system?
  • Required customizations in SAP to use BW extractors (Delta Queue)
  • How to use existing (pre-defined) extractors in Integration Services
  • Create and use a simple custom extractor based on SAP query
  • Create and use a delta-enabled custom extractor based on SAP query


How do I get access to a SAP system?

Before we start, of course we need some access to an existing SAP system. If you have a development system in house the administrators might give you the required access. If not, there are some other options to get started and to create a test environment for connecting SAP to Integration Services. 

Most of the examples I’m using here in this blog post are working fine with the SAP ABAP NetWeaver trial version (so called “mini-SAP”). This edition runs for 90 days and can easily been extended for whatever time is needed. Since the download link changes from time to time it’s best to search for the download using your favorite search engine or to go to the SAP Community website (registration needed) and search for “SAP NetWeaver Trial Version ABAP”. There is an excellent blog series by Manfred Lutz about the ABAP trial version covering the installation process as well as many aspects of configuration. In addition, if you don’t want to install the trial version, I can also recommend having a look at the Consolut website as they’re also offering free of charge access to a mini sap. Until recently they also offered free of charge access to a full SAP IDES system which is somewhat like the AdventureWorks for Microsoft SQL Server (however IDES covers a lot more of real life business processes). Unfortunately they cannot offer this service any longer so you might want to take a look at the SAP offerings:


Required customizations in SAP to use BW extractors (Delta Queue)

Theobald Software has very detailed instructions on its web site about how to use and configure the SAP system. You can find the detailed step-by-step instructions here. For this post, I exactly followed those steps naming all of the created objects exactly as shown in the documentation.


How to use existing (pre-defined) extractors in Integration Services

In SSIS we first need to create a connection to the SAP system. With Theobald Software’s Xtract IS installed, you can add a new connection of type “XTRACT”:


SAP has some specific settings that need to be entered for making a connection (for example the client or the system number). The following dialog box is used to enter these settings which are usually provided by your administrator.


In order to use the source component in a data flow, we first have to create a data flow (obviously) and then drag the Xtract DeltaQ component to this data flow:


When opening the component’s settings you first start filling out the right four boxes. The search buttons help you to find the correct information.


If you’re using the Xtract IS components for the first time or if you are not sure if the customization of the SAP system was correct, it is a good idea to use the “Customization Check” link (marked with the red arrow in the screenshot). This function checks the security settings and customizations of the SAP system and finally gets back with a checklist:


If any of the tests fails the message usually gives administrators enough information to fix this. Next task to do is to find the extractor to deliver data to the data flow. If you’ve installed the SAP trial system, not much content is available. But as mentioned above, you could also connect to the Consolut IDES demo system containing a lot of pre-defined extractors. Now, how do you find the extractor for your data? The easiest thing is to search the web for “sap extractor for <<your topic>>”, for example “sap extractor for profit centers”. You could also go to directly or check the content available on the Consolut website.

For this demonstration, I’m using one of the profit center transaction extractors. The search function makes it easy to find the extractor by name or description. Please note that if you are using the Mini SAP installation, this extractor is not available so you need to go for one of the existing extractors there.


I’m using the first of the extractors in the list (0EC_PCA_1). After hitting ok, we’re back in the main configuration dialog now showing all the fields provided by the extractor:


If filters are defined for the columns in the data source/extractor you will notice an “Edit” link right beside the checkbox for including the column in the output. Filters can be defined following the typical way SAP handles filters (include/exclude, cascaded) and package variables may be used for filter values.

Some important settings can be found in the “Settings” dialog (link right below the dialog window header). Pay special attention to the “automatic data type conversion” setting:


With automatic type conversion enabled, Xtract IS derives the type correctly for many different columns. You can see this effect on the data flow (meta data):

without automatic type conversion with automatic type conversion
image image

Detailed explanation of all the settings for this dialog can be found on the Theobald Software website, for example here. The setting dialog also allows to name a package variable for passing the delta mode as shown below:


Another great feature is the ability to generate a documentation for all fields that are used in the extractor (button “generate documentation”). The output is in Excel format and looks like this:



Create and use a simple custom extractor based on SAP query

If the data you want to load is not covered by an extractor, you can still use the extractor functions by creating a custom extractor. You can do so by using a SAP View, a Query or by writing custom ABAP code. For this example, I’m using SAP Query to create a query based on the simple flight model data that is included in SAP systems (also in the SAP trial version) for educational purposes.

First we need to create an info set for the query using transaction code SQ02:


After clicking “Create” further information has to be provided:


We reference one of the tables of the flight model (sbook, the table for flight bookings) here. The other tables to join are defined in the next step. If you are unsure which tables are needed, a good starting point is to view the table definition using transaction se11 (ABAP dictionary). On the “Entry help/check” tab you see which tables are used to validate entry values. This is shown in the following dialog:


Also double clicking on the entries in the domain or search help column shows more information on the given structure. For this example, we will at least need the following additional tables:

  • SCARR (airline)
  • SFLIGHT (flight details)
  • SPFLI (flight schedule)


From the info set editor you have to click the icon marked in red below in order to add more tables:


For tables that are required more than once (in this case the airport which is needed as deperature and arrival airport) you may create aliases before adding the table to the diagram:




SAP Query tries to automatically detect the links between the tables. However, in my case it didn’t work well, so you have to manually delete and recreate some of the links:



After adjusting the table links, we click on the “-> InfoSet” button to continue with the field selection. Here, the available fields of the query are defined. As a default we choose to start with the key fields here:


Next you can choose the fields you like by placing them in the right pane of the InfoSet editor (via drag and drop). Field groups are used mainly to structure fields a little bit:


Clicking on check validates the structures. In my case I had a wrong link created in the diagram:


So here is how the full diagram looks like after correcting it:tg

Next thing to do is to activate the InfoSet:


We can now create a new extractor (or “generic data source”) by using transaction code rso2. For my example I used “ZTESTHB” as the name for the new extractor. The name will be needed later on to refer to the extractor:


The entry box for the InfoSet becomes available once you hit the “Extraction by Query” button. If our InfoSet is not in the list it might be because of the visibility. In this case, go back to transaction SQ02 and click the “Role/User Group Assignment” button for the InfoSet.


You can now assign the InfoSet to all required user groups. Back in rso2 you may also need an application component which can be created using transaction code rsa6:


You may now save the extractor (aka generic data source). Depending on the fields you’ve chosen, you might get a warning about text fields:


Especially with large tables (the main reason why we’re using the extractor), it’s a good advise to transfer only the keys, not the text fields. Text fields and other descriptive information could be loaded separately (for example using the xtract table component). To remove textfields from the extractor, you can use this menu entry:


When saving the extractor you need to provide a package (or local object) and a workbench request (which you can also create from this dialog).

We’re now ready to use the new extractor from within SSIS. As shown above we need an XTRACT connection the SAP system and an Xtract IS DeltaQ component in the data flow. We can then configure this component:


The data source/extractor should be easy to find using the search function:


I simply select all fields here and activate the source:


A good way to check if everything works fine is the “Preview” functionality:


From within SAP you can use the gateway monitor (transaction code smgw) to see all active connections:


In the data flow the Xtract DeltaQ component has two outgoing connections. One is for the data retrieved, the other is for the log. Writing the log data to a table could also be helpful for tracking errors:




Create and use a delta-enabled custom extractor based on SAP query

So far, the data source/extractor generated does not support any delta mechanism. In order to change this we open the InfoSet again (using transaction code sq02) and add the booking data to the output (if it is not already there).


After re-generating the InfoSet we can edit the data source/extractor again (using transaction code rso2). Clicking on “Generic Delta” opens the following dialog to define the delta mechanism:


For the field we select our order_date field that we have just added to the field list. The safety interval results in the an overlapping data delivery so that we’re sure that we don’t miss any rows. However, in this case, each delta contains some rows from the previous batch which means that we need to take care of this in our data flow.

We can now switch the update mode to “C – Delta Initialization” (and I will show how this is done using a package variable in a minute). The next run of the package fetches all rows from the table.


Now switching the update mode to “D – Delta Update” gives no new rows on the next run:


The request log shows the detailed extraction:


In order to control the update mode from a package variable we first need to create the variable and then enter the variable name in the “Update Mode Variable” of the setting dialog. Also, don’t forget to change the update mode for the Xtract IS DeltaQ component to “V – SSIS variable” for the magic to happen:


At the end of this post I’d like to add some more tips on using the delta queue. First, you may test each extractor using then Extractor Checker (transaction code rsa3).


And delta queue maintenance (transaction code rsa7) may be used to check the current value for the delta field. Also, if something fails because of missing rights, a trace (transaction code ST01) may help (look for failed authority checks) and transaction code SU53 may be used to display the recently failed authority check. And finally, the Delta Queue Monitor (transaction code odqmon) could be helpful to monitor the state of delta enabled extractors.