Search This Blog

Loading...

Sunday, April 12, 2015

Importing data from SAP using the SAP BW Delta Queue and Theobald Software Xtract IS

SQL Server 2005-2014

 

Importing data from SAP is a common task in many BI projects. Since Integration Services (SSIS) doesn’t have out of the box support for dealing with SAP as a data source, third party components should be considered to simplify this task. The components we’re using and which I am writing about here are supplied from Theobald Software and are called Xtract IS. A fully functional trial version is available and the examples in this blog post can be reproduced using this trial version. Xtract IS contains many different options to access a SAP system, for example

  • read table (through rfc_read_table or by using a custom function included in Xtract IS that also supports compression) or sap queries
  • call remote functions (RFCs) and BAPIs
  • Retrieving data from legacy reports
  • Connection to BW via OHS
  • Connection to SAP extractors (aka “delta queue” or “generic data source”)
  • much more…

For this short blog series I’d like to concentrate on the extractor interface which is optimal for reading large tables from SAP. Especially the support for the delta detection is important when loading tables or structures with maybe many billion rows of data. In order to avoid misunderstandings, Theobald’s delta queue component does not require SAP BW to be in place, but it basically acts like BW and uses the same interface to push the data to SSIS in a very efficient way.

Since this post got a little longer, let’s start with an agenda:

 

Agenda

  • How do I get access to a SAP system?
  • Required customizations in SAP to use BW extractors (Delta Queue)
  • How to use existing (pre-defined) extractors in Integration Services
  • Create and use a simple custom extractor based on SAP query
  • Create and use a delta-enabled custom extractor based on SAP query

 

How do I get access to a SAP system?

Before we start, of course we need some access to an existing SAP system. If you have a development system in house the administrators might give you the required access. If not, there are some other options to get started and to create a test environment for connecting SAP to Integration Services. 

Most of the examples I’m using here in this blog post are working fine with the SAP ABAP NetWeaver trial version (so called “mini-SAP”). This edition runs for 90 days and can easily been extended for whatever time is needed. Since the download link changes from time to time it’s best to search for the download using your favorite search engine or to go to the SAP Community website (registration needed) and search for “SAP NetWeaver Trial Version ABAP”. There is an excellent blog series by Manfred Lutz about the ABAP trial version covering the installation process as well as many aspects of configuration. In addition, if you don’t want to install the trial version, I can also recommend having a look at the Consolut website as they’re also offering free of charge access to a mini sap. Until recently they also offered free of charge access to a full SAP IDES system which is somewhat like the AdventureWorks for Microsoft SQL Server (however IDES covers a lot more of real life business processes). Unfortunately they cannot offer this service any longer so you might want to take a look at the SAP offerings:  https://training.sap.com/shop/live-access.

 

Required customizations in SAP to use BW extractors (Delta Queue)

Theobald Software has very detailed instructions on its web site about how to use and configure the SAP system. You can find the detailed step-by-step instructions here. For this post, I exactly followed those steps naming all of the created objects exactly as shown in the documentation.

 

How to use existing (pre-defined) extractors in Integration Services

In SSIS we first need to create a connection to the SAP system. With Theobald Software’s Xtract IS installed, you can add a new connection of type “XTRACT”:

image

SAP has some specific settings that need to be entered for making a connection (for example the client or the system number). The following dialog box is used to enter these settings which are usually provided by your administrator.

t2

In order to use the source component in a data flow, we first have to create a data flow (obviously) and then drag the Xtract DeltaQ component to this data flow:

image

When opening the component’s settings you first start filling out the right four boxes. The search buttons help you to find the correct information.

image

If you’re using the Xtract IS components for the first time or if you are not sure if the customization of the SAP system was correct, it is a good idea to use the “Customization Check” link (marked with the red arrow in the screenshot). This function checks the security settings and customizations of the SAP system and finally gets back with a checklist:

image

If any of the tests fails the message usually gives administrators enough information to fix this. Next task to do is to find the extractor to deliver data to the data flow. If you’ve installed the SAP trial system, not much content is available. But as mentioned above, you could also connect to the Consolut IDES demo system containing a lot of pre-defined extractors. Now, how do you find the extractor for your data? The easiest thing is to search the web for “sap extractor for <<your topic>>”, for example “sap extractor for profit centers”. You could also go to http://help.sap.com directly or check the content available on the Consolut website.

For this demonstration, I’m using one of the profit center transaction extractors. The search function makes it easy to find the extractor by name or description. Please note that if you are using the Mini SAP installation, this extractor is not available so you need to go for one of the existing extractors there.

image

I’m using the first of the extractors in the list (0EC_PCA_1). After hitting ok, we’re back in the main configuration dialog now showing all the fields provided by the extractor:

image

If filters are defined for the columns in the data source/extractor you will notice an “Edit” link right beside the checkbox for including the column in the output. Filters can be defined following the typical way SAP handles filters (include/exclude, cascaded) and package variables may be used for filter values.

Some important settings can be found in the “Settings” dialog (link right below the dialog window header). Pay special attention to the “automatic data type conversion” setting:

image

With automatic type conversion enabled, Xtract IS derives the type correctly for many different columns. You can see this effect on the data flow (meta data):

without automatic type conversion with automatic type conversion
image image

Detailed explanation of all the settings for this dialog can be found on the Theobald Software website, for example here. The setting dialog also allows to name a package variable for passing the delta mode as shown below:

image

Another great feature is the ability to generate a documentation for all fields that are used in the extractor (button “generate documentation”). The output is in Excel format and looks like this:

image

 

Create and use a simple custom extractor based on SAP query

If the data you want to load is not covered by an extractor, you can still use the extractor functions by creating a custom extractor. You can do so by using a SAP View, a Query or by writing custom ABAP code. For this example, I’m using SAP Query to create a query based on the simple flight model data that is included in SAP systems (also in the SAP trial version) for educational purposes.

First we need to create an info set for the query using transaction code SQ02:

t3

After clicking “Create” further information has to be provided:

t4

We reference one of the tables of the flight model (sbook, the table for flight bookings) here. The other tables to join are defined in the next step. If you are unsure which tables are needed, a good starting point is to view the table definition using transaction se11 (ABAP dictionary). On the “Entry help/check” tab you see which tables are used to validate entry values. This is shown in the following dialog:

t5

Also double clicking on the entries in the domain or search help column shows more information on the given structure. For this example, we will at least need the following additional tables:

  • SCARR (airline)
  • SFLIGHT (flight details)
  • SPFLI (flight schedule)

 

From the info set editor you have to click the icon marked in red below in order to add more tables:

t6

For tables that are required more than once (in this case the airport which is needed as deperature and arrival airport) you may create aliases before adding the table to the diagram:

t7

t8

t9

SAP Query tries to automatically detect the links between the tables. However, in my case it didn’t work well, so you have to manually delete and recreate some of the links:

ta

tb

After adjusting the table links, we click on the “-> InfoSet” button to continue with the field selection. Here, the available fields of the query are defined. As a default we choose to start with the key fields here:

td

Next you can choose the fields you like by placing them in the right pane of the InfoSet editor (via drag and drop). Field groups are used mainly to structure fields a little bit:

te

Clicking on check validates the structures. In my case I had a wrong link created in the diagram:

tf

So here is how the full diagram looks like after correcting it:tg

Next thing to do is to activate the InfoSet:

th

We can now create a new extractor (or “generic data source”) by using transaction code rso2. For my example I used “ZTESTHB” as the name for the new extractor. The name will be needed later on to refer to the extractor:

ti

The entry box for the InfoSet becomes available once you hit the “Extraction by Query” button. If our InfoSet is not in the list it might be because of the visibility. In this case, go back to transaction SQ02 and click the “Role/User Group Assignment” button for the InfoSet.

tj

You can now assign the InfoSet to all required user groups. Back in rso2 you may also need an application component which can be created using transaction code rsa6:

tk

You may now save the extractor (aka generic data source). Depending on the fields you’ve chosen, you might get a warning about text fields:

tl

Especially with large tables (the main reason why we’re using the extractor), it’s a good advise to transfer only the keys, not the text fields. Text fields and other descriptive information could be loaded separately (for example using the xtract table component). To remove textfields from the extractor, you can use this menu entry:

tm

When saving the extractor you need to provide a package (or local object) and a workbench request (which you can also create from this dialog).

We’re now ready to use the new extractor from within SSIS. As shown above we need an XTRACT connection the SAP system and an Xtract IS DeltaQ component in the data flow. We can then configure this component:

tq

The data source/extractor should be easy to find using the search function:

tr

I simply select all fields here and activate the source:

ts

A good way to check if everything works fine is the “Preview” functionality:

tt

From within SAP you can use the gateway monitor (transaction code smgw) to see all active connections:

tu

In the data flow the Xtract DeltaQ component has two outgoing connections. One is for the data retrieved, the other is for the log. Writing the log data to a table could also be helpful for tracking errors:

tv

tw

 

Create and use a delta-enabled custom extractor based on SAP query

So far, the data source/extractor generated does not support any delta mechanism. In order to change this we open the InfoSet again (using transaction code sq02) and add the booking data to the output (if it is not already there).

tx

After re-generating the InfoSet we can edit the data source/extractor again (using transaction code rso2). Clicking on “Generic Delta” opens the following dialog to define the delta mechanism:

ty

For the field we select our order_date field that we have just added to the field list. The safety interval results in the an overlapping data delivery so that we’re sure that we don’t miss any rows. However, in this case, each delta contains some rows from the previous batch which means that we need to take care of this in our data flow.

We can now switch the update mode to “C – Delta Initialization” (and I will show how this is done using a package variable in a minute). The next run of the package fetches all rows from the table.

tz

Now switching the update mode to “D – Delta Update” gives no new rows on the next run:

tz1

The request log shows the detailed extraction:

tz2

In order to control the update mode from a package variable we first need to create the variable and then enter the variable name in the “Update Mode Variable” of the setting dialog. Also, don’t forget to change the update mode for the Xtract IS DeltaQ component to “V – SSIS variable” for the magic to happen:

tz4

At the end of this post I’d like to add some more tips on using the delta queue. First, you may test each extractor using then Extractor Checker (transaction code rsa3).

tz5

And delta queue maintenance (transaction code rsa7) may be used to check the current value for the delta field. Also, if something fails because of missing rights, a trace (transaction code ST01) may help (look for failed authority checks) and transaction code SU53 may be used to display the recently failed authority check. And finally, the Delta Queue Monitor (transaction code odqmon) could be helpful to monitor the state of delta enabled extractors.

Monday, March 30, 2015

Un-pivoting plan or forecast data in Excel with PowerQuery

PowerBI | PowerQuery

PowerQuery enhances Microsoft Excel by providing a self-service data integration toolset to combine external data with local data. But PowerQuery may also be used just to do useful transformation tasks within Excel. In this example, we start with a simple Excel workbook containing planning data for 5 products and 12 periods and a regional allocation key for 4 regions.

Products:

image

Regions and allocation:

image

For further processing a table of the following format is required with the plan values being un-pivoted and the allocation key being applied for each product, month and region:

image
                           …
image

So, how can we do this in Excel? Before PowerQuery we might have been using some Excel tricks with functions like index or mod, but this approach is not very flexible. What about adding an additional product or another region? What we really need is a simple transformation that allows us to refresh the resulting table based on any changes of the underlying tables.

The following video shows my step-by-step approach to this task.

Unpivoting Excel Data in PowerQuery

You can also download the final Excel file here.

Tuesday, February 17, 2015

Querying Hadoop from SQL Server

SQL Server 2012 | SQL Server 2014

Microsoft’s Analytics Platform System (APS) offers built in transparent access to Hadoop data sources through the Polybase technology. This includes bidirectional access not only to Hadoop but also to Cloud services. The SMP SQL Server currently doesn’t contain Polybase, so access to Hadoop needs to be handled differently. Will Polybase be available in an upcoming SMP SQL Server? From the past we saw some technology making its way from PDW to SMP SQL Server, for example the clustered columnstore index, the cardinality estimation or the batch mode table operations. So let’s hope that Polybase makes it into the SMP SQL Server soon. Until then, one option is to use the HortonWorks ODBC driver and linked tables. To be honest, Polybase is a much more powerful technology since it uses cost-based cross platform query optimization which includes the ability to push down tasks to the Hadoop cluster when it makes sense. Also, Polybase doesn’t rely on Hive but access the files directly in parallel, thus giving a great performance. Linked tables are less powerful but may still be useful for some cases.

So, here we go. First, you need to download the ODBC driver from the Hortonworks add-ons page: http://hortonworks.com/hdp/addons/.

Make sure you pick the right version (32 bit/64 bit) for your operating system. After the installation completes, we need to set up an ODBC connection. Therefore, start the ODBC Datasource Adminstrator (Windows+S, then type ‘ODBC’). Again, make sure to start the correct version (32 bit/64 bit). The installer has already created a connection but you still need to supply the connection properties. I created a new connection instead:

p1

I’m connecting to the Hortonworks Sandbox here (HDP 2.1, I had problems connecting to HDP 2.2 with the current version of the ODBC driver). Instead of the host name you can also enter the IP address (usually 127.0.0.1 for the sandbox) but in order to get other tools running (like Redgate Hdfs Explorer) I configured the sandbox virtual machine to run on a bridged network and put the bridge network IP address of the sandbox (console command “ip addr”) in my local host file.

You should now click on Test to verify that the connection actually works:

p2

In SQL Server Management Studio we can now create a linked server connection to the Hadoop system using the following command:

EXEC master.dbo.sp_addlinkedserver
@server = N'Hadoop',
@srvproduct=N'HIVE',
@provider=N'MSDASQL',
@datasrc=N'HDP',
@provstr=N'Provider=MSDASQL.1;Persist Security Info=True;User ID=hue;'

Depending on you Hadoop’s security settings, you might need to provide a password for the provider string as well. The @server name is used to refer to the linked server later while the @datasrc names the ODBC connection (see “Data Source Name” in the configuration dialog of the connection above).

With the new linked server, we can now explore the Hive database in Management Studio:

p3

In order to run a query on for example table “sample_07” you can user one of the following commands:

select * from openquery (Hadoop, 'select * from Sample_07')

or

select * from [Hadoop].[HIVE].[default].[sample_07]

For both queries, “Hadoop” refers to the name of the linked server (@server parameter in the SQL statement from above).

If you get the following error message, this means that you are not allowed to query the table:

OLE DB provider "MSDASQL" for linked server "Hadoop" returned message "[Hortonworks][HiveODBC] (35) Error from Hive: error code: '40000' error message: 'Error while compiling statement: FAILED: HiveAccessControlException Permission denied. Principal [name=hue, type=USER] does not have following privileges on Object [type=TABLE_OR_VIEW, name=default.sample_07] : [SELECT]'.".
Msg 7306, Level 16, State 2, Line 1
Cannot open the table ""HIVE"."default"."sample_08"" from OLE DB provider "MSDASQL" for linked server "Hadoop".

In this case, you should simply give the user from you ODBC connection the SELECT right. To do so, run the following query in Hive:

grant select on sample_07 to user hue;

That’s it. You should now get the contents of the table in SQL Server:

p4

You might want to set the length of string columns manually because Hive does not return the size of the string column (in Hive, the column type is simply “string”). The size returned from the query results from the advanced ODBC-settings of our connection. I left everything on default here, so here is how it looks:

image

So, the default string column length is 255 here. Let’ check and copy the data over to SQL Server:

select * into sample_07 from [Hadoop].[HIVE].[default].[sample_07]

The resulting table looks like this:

image

To have a more precise control of the column length, you should use the convert function here, for example:

select
convert(nvarchar(50),[code]) [code],
convert(nvarchar(80),[description]) [description],
total_emp,
salary
from [Hadoop].[HIVE].[default].[sample_07]

Be careful with the remaining setting in the advanced options dialog. For example, checking “Use native query” means that you pass the query (openquery-Syntax) as it is to Hive. This could be intended to fully leverage specific features of Hive, but this could also lead to errors if you’re not familiar with the HiveQL query syntax. Also, to get a better with larger tables you might want to adjust the “Rows fetched per block” option to a larger value.

With HDP 2.2 you should also be able to write to the table (create a new table, grant all permissions and run an insert into) but I couldn’t do on my HDP 2.1 machine.

Summary

Until Polybase makes it into the SMP SQL Server product, Hadoop data may be queried from SQL Server using the ODBC driver and the linked server object. This could also be an option for Analysis Services to connect to Hadoop by using SQL Server views via linked server, since Analysis Services doesn’t support ODBC in multi dimensional mode. However, Polybase on the APS gives a much better performance because of the intelligent cross platform query optimizer and Polybase can also be used to write data to Hadoop, so I hope we’ll find this technology in the SMP SQL Server soon.