Search This Blog

Sunday, November 24, 2013

Querying SSIS package pipeline output from T-SQL

SQL Server 2012

This post is about the SSIS Data Streaming Destination. I must admit I wasn’t aware that it existed until I saw it in a presentation some weeks ago. Reason enough to write a short blog post about it. You can find a more detailed explanation here:

The Data Streaming Destination comes with the Microsoft SQL Server 2012 Integration Services Data Feed Publishing Components, which can be downloaded using this link. To demonstrate the functionality, I created a simple SSIS package that reads data from the DimCustomer table of the AdventureWorks database. The output is then routed to the Data Streaming Destination.


The project containing the package is then deployed to the Integration Services Catalog on a SQL Server 2012 (project deployment model). Here is the path and package name that I used for this example:


Next we need to create a linked server to SSIS. In order to do so, two prerequisites have to be checked:

  • SSISOleDb Provider has to be installed
  • The “Allow inprocess” option of the SSISOleDb provider has to be enabled


You can check this in Management Studio using the object explorer by expanding the tree Server Objects/Linked Servers/Provider:


Double click the SSISOLEDB entry brings up the configuration dialog:


Make sure the option “Allow inprocess” is enabled. You may use this dialog or execute the following script:

USE [master]
EXEC master.dbo.sp_MSset_oledb_prop N'SSISOLEDB', N'AllowInProcess', 1


We now create a linked server to SSIS:

USE [master]
EXEC sp_addlinkedserver
@server = N'SSISFeedServer',
@srvproduct = N'Microsoft',
@provider = N'SSISOLEDB',
@datasrc = N'.'


And finally we can query the package result using the following T-SQL command:

SELECT * FROM OPENQUERY(SSISFeedServer,N'Folder=test;Project=SSISDataFeed;Package=DSD_Demo.dtsx')


Within the 2nd parameter of the OPENQUERY command you may also specify

  • Parameters
    Syntax is …;Parameters="<par_name1>=<par_val1>;<par_name2>=<par_val2>";…
  • Timeout
    Syntax is …;Timeout=<Number of seconds>;…
  • Usage of 32bit Runtime
    Syntax is …;Use32BitRuntime=<true | false>;…


Ok, the customer table is quite small, so let’s try with a bigger table, for example the FactProductInventory table:

  direct SQL Streaming
count(*) 0 sec 8 sec
sum(UnitsIn) 0 sec 8 sec
join to dimension and filter by dimension property, count(*)1 0 sec 8 sec
join to other fact table by date/product, group by date2 1 sec 103 sec
(output started at 17 sec)

1) Query was: SELECT count(*) From OPENQUERY(SSISFeedServer,N'Folder=test;Project=SSISDataFeed;Package=DSD_Demo2.dtsx')  PIN
inner join DimProduct P on PIN.ProductKey = P.ProductKey
where P.StandardCost>10.0

2) Query was: SELECT PIN.DateKey, count(*) From OPENQUERY(SSISFeedServer,N'Folder=test;Project=SSISDataFeed;Package=DSD_Demo2.dtsx') PIN
inner join FactInternetSales FIS on PIN.DateKey=FIS.OrderDateKey and PIN.ProductKey=FIS.ProductKey
group by PIN.DateKey


Some final remarks:

  • Instead of waiting for the package to end, rows are transferred to the SQL query as they are returned from the SSIS package pipeline.
  • Of course, this approach has some latency as the SSIS package needs to be validated and prepared
  • Not all column types are supported, for example varbinary(max) is not supported
  • Make sure to have only one streaming destination in your package. I ran into problems even if I had only two destinations with one being disabled.

Saturday, November 2, 2013

Using Power Query to combine files from Hadoop into a single table (part 2)

Hadoop | HDInsight | Power Query Preview


As in my previous post I’d like to import a bunch of csv-files from a Hadoop cluster into a single Power Pivot table using Power Query. In the last post I used the basic built-in functionality for this task. In this post, I will create a custom function to perform the load. Using a custom function allows you to have more control over the import process. For example, you could easily import different file formats using this way.

The steps below involve a little bit of work on the code generated by Power Query. The Power Query coding is done using the Power Query language (formerly known as ‘M’). So this example is more targeting the power user who is getting deeper into the functions of Power Query.

Also, in order to be able to work with the Power Query functions directly, you need to turn on the “Enable Advanced Query Editing” option in the options dialog of Power Query (click “Options” on the Power Query ribbon).

The basic idea follows these steps:

  1. Use Power Query to load a single file from hdfs
  2. Wrap the code from step 1 into a function
  3. Get a list of all files in the hdfs directory and invoke the function for each of the files

The basic idea for joining the tables and the inspiration comes from a great presentation by Christoph Seck which I had the chance to see recently.


1. Use Power Query to load a single file from hdfs

We start exactly as in the post before. First we need to connect to the Hadoop file system and get a list of all the csv files:


In the last post we took the Binary column to retrieve the file contents directly.  However, for this post I will use a different approach now, that allows me to control the loading of each file in detail.

For my example, let’s start with a single file load. We could use any file here, so I took the first one (20130916.csv). Loading the file contents can be done by clicking the link “Binary” in the desired row. This results in Power Query fetching only the contents of this single file:


Note that Power Pivot did automatically detect the column types. This can be changed if needed, but for my purpose, I leave it that way. The only thing I do is to rename the columns by right clicking the column header and choose rename. The result is shown in the next step.

2 Wrap the code from step 1 into a function

In order to read several files, we need to write a function that basically takes the filepath and the filename and returns a table with the file content. The code generated in step 1 is a good starting point.


What we will do here is to view the generated M code by clicking on the code icon (marked red in the screenshot):

    Source = Hdfs.Files("http://hdp01:50070/webhdfs/v1/user/demo/csv"),
    #"http://hdp01:50070/webhdfs/v1/user/demo/csv/_20130916.csv" = Source{[#"Folder Path"="http://hdp01:50070/webhdfs/v1/user/demo/csv/",Name="20130916.csv"]}[Content],
    ImportedCSV = Csv.Document(#"http://hdp01:50070/webhdfs/v1/user/demo/csv/_20130916.csv"),
    ChangedType = Table.TransformColumnTypes(ImportedCSV,{{"Column1", type date}, {"Column2", type number}, {"Column3", type number}}),
    RenamedColumns = Table.RenameColumns(ChangedType,{{"Column1", "Date"}, {"Column2", "Product"}, {"Column3", "Sales"}})

I’m not getting into too much details about the Power Query language here. Basically, you see each transformation step as a named assignment in the section that is preceded by “let”. The output (final result) is defined by the in clause at the end. I marked those parts of the query in red that need to be replaced by variables. Don’t be confused by the generated variable names (like #"http://…"). You may replace them directly in the code editor by a comprehensive name.

In order to turn this code into a function we wrap it in a let/in construct as shown below:

    Hdfs_FetchFileContent = (folderPath, fileName) =>
    Source = Hdfs.Files("http://hdp01:50070/webhdfs/v1/user/demo/csv"),
    fileContent = Source{[#"Folder Path"=folderPath,Name=fileName]}[Content],
    ImportedCSV = Csv.Document(fileContent),
    ChangedType = Table.TransformColumnTypes(ImportedCSV,{{"Column1", type date}, {"Column2", type number}, {"Column3", type number}}),
    RenamedColumns = Table.RenameColumns(ChangedType,{{"Column1", "Date"}, {"Column2", "Product"}, {"Column3", "Sales"}})

Click on Ok to submit the formula. The outer let/in pair is used to send the variables to the inner let/in pair which does the work. By clicking ok, the code is parsed and the result should look like this:


At this point, it’s a good idea the test the new function by clicking “Invoke”. For my example I’m entering exactly the values I had in the formula before replacing them by a variable:


The result shows exactly the values from above but this time being pulled by a function:


Next thing is the rename the query (right click on “Query1”) to “Hdfs_FetchSingleFile” (this is the name we will use to call the function later on) and to delete the invoke-step of my test (otherwise a table is returned, not a function). Do not forget this step because otherwise the function will not work:



3. Get a list of all files in the hdfs directory and invoke the function for each of the files

In order to get a list of all files in my csv-folder, I follow exactly the procedure from step 1. We start with the list of all the csv files in the given directory. I’m only interested in the columns Name and Folder Name, so I delete all the other columns here (right click on column header->remove, multiple columns may be selected). This is how the result looks like:


Please note that we still could apply filters on this file list (for example by create date, file name, file type or other attributes of the file).

I now right click on one of the column headers and choose Insert Column/Custom.

This brings up a query editor for my new column. I’m going to use the function created in step 2 here. Since I named the function “Hdfs_FetchSingleFile” it may be invoked as:

Hdfs_FetchSingleFile([Folder Path],[Name])


The columns may be inserted by double clicking on the column name in the “Available Columns” box. You may wonder how you can show the results of a full file (table) for each row. Here’s how Power Query handles it:


As you can see, we created a new column (“Custom”) which just shows “Table” as entries. In order to retrieve the table data, click on the arrows on the columns header:


This allows us to select the columns we like to retrieve from each table. Note that the correct column headers are shown in the selection.


I use all columns here and now, by clicking Ok, all files/tables are expanded properly:


How cool is that!? Being able to work with table structures within a table and expanding them to details makes processing very easy for us here.

The remaining steps are to delete the first two columns (we’re not interested in them), rename the columns accordingly and set the columns data type correctly (date for Date, numeric for product and sales):


Power Query generates the script for us while we do all those changes. We may now rename the script and return the result to Excel by clicking Ok.

And here is the final result in Excel.


Please note that you can refresh the table at any time by simply clicking Refresh in the context menu.  This causes the full script from above to be executed again, so if new files are stored on the hdfs file system, the query brings them all back here.

Last step shown here is to load the data in my Power Pivot data model by clicking the “Load to data model” button/link in the query settings pane:


This takes the data in the Power Pivot data model allowing us to create calculations or to blend this with other data sources:



Querying big data sources like Hadoop might not be a common use case. As I mentioned before in general it would be better to combine/filter/transform the files on the Hadoop cluster before loading to a Power Pivot model. But there may still some cases where the approach above might be useful. And of course we could filter down the file list (for example to read only files starting with a given date) or the file content (with the reader function from above).

So, Power Query and the embedded formula language offer very powerful tools for retrieving and transforming (reshaping) data. For most cases none or only little coding is required and the step by step engine allows to get back to a specific point in the transformation at any time. The use cases for Power Query are literally endless and this tool is definitely a big step ahead for bringing Business Intelligence Power to the end users.