Search This Blog

Saturday, December 14, 2013

T-SQL 2012 Window Functions in ELT (part 1)

SQL Server 2012

One of the improvements with SQL Server 2012 are the advanced window functions which allow us for example to refer to a record before or after the current record in a result set. Since databases are getting more and more powerful, standard ETL processes may not be the approach with the highest performance when it comes to complex transformations involving a lot of data. In these cases, SQL may be executed faster, especially on a massively parallel processing platform like the Parallel Data Warehouse.

To illustrate this point, let’s assume you have some sensor data which looks like this:

image

Our sensors report information about temperature and throughput at an interval of approximately 15 minutes. The sensors also count errors and report them if they occurred.

Now, image you are a data scientist and by intuition you think it’s possible that a certain problem may depend on the time since the last failure occurred. In order to prove this, you want to set up a mining model. But what you need is the time difference to the latest failure. I don’t want to concentrate on the data mining part here, but more on the calculation based on the time.

You could imagine creating an ETL process (for example a script task) to compute this, but in this post I will show an approach that uses T-SQL. One obvious solution is to use cursors, but since they are slow and we have a large row number of sensor data, we need to find a different approach.

In the first step I’m adding a column that shows the number of seconds between the current measurement and the last measurement. This can be done using the lag function. For the following T-SQL query, I also include the number of seconds that have passed:

select
    sensorid
    , [datetime]
    , lag([DateTime],1) over (partition by sensorid order by [DateTime]) LastDateTime
    , datediff(
        second
        , lag([DateTime],1) over (partition by sensorid order by [DateTime])
        , [DateTime]
    ) SecondsPassed
from [dbo].[SensorData]

Here is what the result looks like:

image

The partitioning part of the lag function is used here to make sure, that we start all over again whenever a new sensor is reached. Since there is no prior record in this case, we’re getting a NULL result.

So you can easily create time deltas using this technique. But of course, this alone doesn’t solve our problem. What we needed is a running total over the seconds. This is still not very complicated. But the running total has to go back to 0 whenever a failure happens. In order to solve this problem, one approach could be to first create a running total over the error count (the total number of errors per sensor). Since this number changes only when an error occurred, we can use this total error number as the partition for a window function calculating the running total of the seconds. Sounds confusing? Well, let’s do it step by step. The first step is to include the total number of errors per sensor. For simplicity I removed the other columns from my query, so this is how we could compute the total errors:

select
    sensorid
    , [datetime]       
    , datediff(
        second
        , lag([DateTime],1) over (partition by sensorid order by [DateTime])
        , [DateTime]
        ) SecondsPassed
    , [FailCount]
    , sum(FailCount) over (partition by sensorid order by [DateTime]) TotalFailCount
--into #Tmp
from [dbo].[SensorData]

 

image

This also shows the power of the window functions, as this is a very elegant way to create a running total. The partition clause (partition by sensorid) makes sure that the running total starts with 0 on every new sensor. Now, in order to compute the number of seconds since the last error, we can use this running total as the partition (together with the sensorid). To show this, I’m going to write the result from the query above to a temporary table by removing the -- in front of “into #Tmp”.

I can now refer to the computed running total in the partition expression of the running total of the SecondsPassed column:

select
    sensorid
    , [datetime]       
    , [FailCount]
    , TotalFailCount
    , SecondsPassed
    , sum(SecondsPassed)
       
over (
            partition by sensorid, TotalFailCount order by [DateTime]
        )
SecondsUntilLastFailure

    from #Tmp

And here is the result:

image

So, whenever an error occured , the running total starts all over again. Isn’t this cool? Well, you might say, it would be better if the SecondsUntilLastFailure would start with 0 in case of an error. In order for this to happen, we have to subtract the first value in each partition. Sounds complicated? Well, we’re lucky, since there is a first_value function! So, here is the final code for our query:

select
    sensorid
    , [datetime]       
    , [FailCount]
    , TotalFailCount
    , SecondsPassed
    ,
sum(SecondsPassed)
        over (
            partition by sensorid, TotalFailCount order by [DateTime]
        )
      -
      first_value(SecondsPassed)
        over (
            partition by sensorid, TotalFailCount order by [DateTime]
        ) SecondsUntilLastFailure
    from #Tmp

And here is the result, now starting with a delta of zero seconds on an error:

image

As you can see, the column SecondsUntilLastFailure computes correctly now. Whenever a failure occurred, the counter goes down to zero and then increment until the next error happens. Of course, you could also compute the next (not the prior) time stamp to avoid the need for the first_value – function, but I wanted to show this here too.

So,  T-SQL windowing functions are a very powerful feature to SQL and can be used in quite a lot of scenarios. I’m getting back to an interesting scenario in my next post. And there is still a lot more to discover with functions like row_number, rank, denserank, ntile and the common aggregation functions like sum, max, min or avg.

Sunday, November 24, 2013

Querying SSIS package pipeline output from T-SQL

SQL Server 2012

This post is about the SSIS Data Streaming Destination. I must admit I wasn’t aware that it existed until I saw it in a presentation some weeks ago. Reason enough to write a short blog post about it. You can find a more detailed explanation here: http://office.microsoft.com/en-us/office365-sharepoint-online-enterprise-help/publish-ssis-packages-as-odata-feed-sources-HA104079177.aspx

The Data Streaming Destination comes with the Microsoft SQL Server 2012 Integration Services Data Feed Publishing Components, which can be downloaded using this link. To demonstrate the functionality, I created a simple SSIS package that reads data from the DimCustomer table of the AdventureWorks database. The output is then routed to the Data Streaming Destination.

image

The project containing the package is then deployed to the Integration Services Catalog on a SQL Server 2012 (project deployment model). Here is the path and package name that I used for this example:

image

Next we need to create a linked server to SSIS. In order to do so, two prerequisites have to be checked:

  • SSISOleDb Provider has to be installed
  • The “Allow inprocess” option of the SSISOleDb provider has to be enabled

 

You can check this in Management Studio using the object explorer by expanding the tree Server Objects/Linked Servers/Provider:

image

Double click the SSISOLEDB entry brings up the configuration dialog:

image

Make sure the option “Allow inprocess” is enabled. You may use this dialog or execute the following script:

USE [master]
GO
EXEC master.dbo.sp_MSset_oledb_prop N'SSISOLEDB', N'AllowInProcess', 1
GO

 

We now create a linked server to SSIS:

USE [master]
GO
EXEC sp_addlinkedserver
@server = N'SSISFeedServer',
@srvproduct = N'Microsoft',
@provider = N'SSISOLEDB',
@datasrc = N'.'
GO

image

And finally we can query the package result using the following T-SQL command:

SELECT * FROM OPENQUERY(SSISFeedServer,N'Folder=test;Project=SSISDataFeed;Package=DSD_Demo.dtsx')

image

Within the 2nd parameter of the OPENQUERY command you may also specify

  • Parameters
    Syntax is …;Parameters="<par_name1>=<par_val1>;<par_name2>=<par_val2>";…
  • Timeout
    Syntax is …;Timeout=<Number of seconds>;…
  • Usage of 32bit Runtime
    Syntax is …;Use32BitRuntime=<true | false>;…

 

Ok, the customer table is quite small, so let’s try with a bigger table, for example the FactProductInventory table:

  direct SQL Streaming
count(*) 0 sec 8 sec
sum(UnitsIn) 0 sec 8 sec
join to dimension and filter by dimension property, count(*)1 0 sec 8 sec
join to other fact table by date/product, group by date2 1 sec 103 sec
(output started at 17 sec)

1) Query was: SELECT count(*) From OPENQUERY(SSISFeedServer,N'Folder=test;Project=SSISDataFeed;Package=DSD_Demo2.dtsx')  PIN
inner join DimProduct P on PIN.ProductKey = P.ProductKey
where P.StandardCost>10.0

2) Query was: SELECT PIN.DateKey, count(*) From OPENQUERY(SSISFeedServer,N'Folder=test;Project=SSISDataFeed;Package=DSD_Demo2.dtsx') PIN
inner join FactInternetSales FIS on PIN.DateKey=FIS.OrderDateKey and PIN.ProductKey=FIS.ProductKey
group by PIN.DateKey

 

Some final remarks:

  • Instead of waiting for the package to end, rows are transferred to the SQL query as they are returned from the SSIS package pipeline.
  • Of course, this approach has some latency as the SSIS package needs to be validated and prepared
  • Not all column types are supported, for example varbinary(max) is not supported
  • Make sure to have only one streaming destination in your package. I ran into problems even if I had only two destinations with one being disabled.

Saturday, November 2, 2013

Using Power Query to combine files from Hadoop into a single table (part 2)

Hadoop | HDInsight | Power Query Preview

 

As in my previous post I’d like to import a bunch of csv-files from a Hadoop cluster into a single Power Pivot table using Power Query. In the last post I used the basic built-in functionality for this task. In this post, I will create a custom function to perform the load. Using a custom function allows you to have more control over the import process. For example, you could easily import different file formats using this way.

The steps below involve a little bit of work on the code generated by Power Query. The Power Query coding is done using the Power Query language (formerly known as ‘M’). So this example is more targeting the power user who is getting deeper into the functions of Power Query.

Also, in order to be able to work with the Power Query functions directly, you need to turn on the “Enable Advanced Query Editing” option in the options dialog of Power Query (click “Options” on the Power Query ribbon).

The basic idea follows these steps:

  1. Use Power Query to load a single file from hdfs
  2. Wrap the code from step 1 into a function
  3. Get a list of all files in the hdfs directory and invoke the function for each of the files

The basic idea for joining the tables and the inspiration comes from a great presentation by Christoph Seck which I had the chance to see recently.

 

1. Use Power Query to load a single file from hdfs

We start exactly as in the post before. First we need to connect to the Hadoop file system and get a list of all the csv files:

image_thumb7

In the last post we took the Binary column to retrieve the file contents directly.  However, for this post I will use a different approach now, that allows me to control the loading of each file in detail.

For my example, let’s start with a single file load. We could use any file here, so I took the first one (20130916.csv). Loading the file contents can be done by clicking the link “Binary” in the desired row. This results in Power Query fetching only the contents of this single file:

image_thumb14

Note that Power Pivot did automatically detect the column types. This can be changed if needed, but for my purpose, I leave it that way. The only thing I do is to rename the columns by right clicking the column header and choose rename. The result is shown in the next step.

2 Wrap the code from step 1 into a function

In order to read several files, we need to write a function that basically takes the filepath and the filename and returns a table with the file content. The code generated in step 1 is a good starting point.

image_thumb18

What we will do here is to view the generated M code by clicking on the code icon (marked red in the screenshot):

let
    Source = Hdfs.Files("http://hdp01:50070/webhdfs/v1/user/demo/csv"),
    #"http://hdp01:50070/webhdfs/v1/user/demo/csv/_20130916.csv" = Source{[#"Folder Path"="http://hdp01:50070/webhdfs/v1/user/demo/csv/",Name="20130916.csv"]}[Content],
    ImportedCSV = Csv.Document(#"http://hdp01:50070/webhdfs/v1/user/demo/csv/_20130916.csv"),
    ChangedType = Table.TransformColumnTypes(ImportedCSV,{{"Column1", type date}, {"Column2", type number}, {"Column3", type number}}),
    RenamedColumns = Table.RenameColumns(ChangedType,{{"Column1", "Date"}, {"Column2", "Product"}, {"Column3", "Sales"}})
in
    RenamedColumns

I’m not getting into too much details about the Power Query language here. Basically, you see each transformation step as a named assignment in the section that is preceded by “let”. The output (final result) is defined by the in clause at the end. I marked those parts of the query in red that need to be replaced by variables. Don’t be confused by the generated variable names (like #"http://…"). You may replace them directly in the code editor by a comprehensive name.

In order to turn this code into a function we wrap it in a let/in construct as shown below:

let
    Hdfs_FetchFileContent = (folderPath, fileName) =>
let
    Source = Hdfs.Files("http://hdp01:50070/webhdfs/v1/user/demo/csv"),
    fileContent = Source{[#"Folder Path"=folderPath,Name=fileName]}[Content],
    ImportedCSV = Csv.Document(fileContent),
    ChangedType = Table.TransformColumnTypes(ImportedCSV,{{"Column1", type date}, {"Column2", type number}, {"Column3", type number}}),
    RenamedColumns = Table.RenameColumns(ChangedType,{{"Column1", "Date"}, {"Column2", "Product"}, {"Column3", "Sales"}})
in
    RenamedColumns
in
    Hdfs_FetchFileContent

Click on Ok to submit the formula. The outer let/in pair is used to send the variables to the inner let/in pair which does the work. By clicking ok, the code is parsed and the result should look like this:

image_thumb24

At this point, it’s a good idea the test the new function by clicking “Invoke”. For my example I’m entering exactly the values I had in the formula before replacing them by a variable:

image_thumb32

The result shows exactly the values from above but this time being pulled by a function:

image_thumb291

Next thing is the rename the query (right click on “Query1”) to “Hdfs_FetchSingleFile” (this is the name we will use to call the function later on) and to delete the invoke-step of my test (otherwise a table is returned, not a function). Do not forget this step because otherwise the function will not work:

image_thumb35

 

3. Get a list of all files in the hdfs directory and invoke the function for each of the files

In order to get a list of all files in my csv-folder, I follow exactly the procedure from step 1. We start with the list of all the csv files in the given directory. I’m only interested in the columns Name and Folder Name, so I delete all the other columns here (right click on column header->remove, multiple columns may be selected). This is how the result looks like:

image_thumb371

Please note that we still could apply filters on this file list (for example by create date, file name, file type or other attributes of the file).

I now right click on one of the column headers and choose Insert Column/Custom.

This brings up a query editor for my new column. I’m going to use the function created in step 2 here. Since I named the function “Hdfs_FetchSingleFile” it may be invoked as:

Hdfs_FetchSingleFile([Folder Path],[Name])

image_thumb39

The columns may be inserted by double clicking on the column name in the “Available Columns” box. You may wonder how you can show the results of a full file (table) for each row. Here’s how Power Query handles it:

image_thumb41

As you can see, we created a new column (“Custom”) which just shows “Table” as entries. In order to retrieve the table data, click on the arrows on the columns header:

image_thumb29

This allows us to select the columns we like to retrieve from each table. Note that the correct column headers are shown in the selection.

image_thumb44

I use all columns here and now, by clicking Ok, all files/tables are expanded properly:

image_thumb43

How cool is that!? Being able to work with table structures within a table and expanding them to details makes processing very easy for us here.

The remaining steps are to delete the first two columns (we’re not interested in them), rename the columns accordingly and set the columns data type correctly (date for Date, numeric for product and sales):

image_thumb46

Power Query generates the script for us while we do all those changes. We may now rename the script and return the result to Excel by clicking Ok.

And here is the final result in Excel.

image_thumb48

Please note that you can refresh the table at any time by simply clicking Refresh in the context menu.  This causes the full script from above to be executed again, so if new files are stored on the hdfs file system, the query brings them all back here.

Last step shown here is to load the data in my Power Pivot data model by clicking the “Load to data model” button/link in the query settings pane:

image_thumb37

This takes the data in the Power Pivot data model allowing us to create calculations or to blend this with other data sources:

image_thumb38

 

Querying big data sources like Hadoop might not be a common use case. As I mentioned before in general it would be better to combine/filter/transform the files on the Hadoop cluster before loading to a Power Pivot model. But there may still some cases where the approach above might be useful. And of course we could filter down the file list (for example to read only files starting with a given date) or the file content (with the reader function from above).

So, Power Query and the embedded formula language offer very powerful tools for retrieving and transforming (reshaping) data. For most cases none or only little coding is required and the step by step engine allows to get back to a specific point in the transformation at any time. The use cases for Power Query are literally endless and this tool is definitely a big step ahead for bringing Business Intelligence Power to the end users.

Sunday, October 20, 2013

Using Power Query to combine files from Hadoop into a single table (part 1)

Hadoop | HDInsight | Power Query Preview

For today’s post, I’m showing a scenario where I want to load some text files from my Hadoop cluster into a single Excel Power Pivot table. For this purpose I created some files with meaningless content and three columns (date, product, sales). We will see, that Power Pivot offers builtin support for this use case from it’s GUI (ability to combine multiple files), but we will use a custom import function for each of the files. This gives us the maximum of flexibility to control the import process.

I stored those files on hdfs in a directory /users/demo/csv. The purpose is to load all the files from this directory into a single table.

One more thing before we get started. This post if more intended to demonstrate the functionality of Power Query functions. Of course, if you have many small files on your hadoop cluster, the better solution would be to assemble the files on the cluster itself. In my case I could launch the following pig latin script to create a joined file:

a = LOAD '/user/demo/csv/*.csv' using PigStorage(',') AS (Date:chararray, Product:int, Sales:float);
STORE a INTO '/user/demo/result.csv';

Here are the first output rows of this script (when using dump a):

image

But let’s try to achieve the same result using Power Query. This post will be separated in two parts:

part 1: Combine/import the files using Power BI basic functionality

part 2: Write a custom function to combine/import the csv files

 

For part 1 I will first connect to the Hadoop cluster. So, I open a blank Excel file, go to the Power Query ribbon and choose “From Other Sources/From Hadoop File (HDFS)”:

image

In the following dialog, you need to enter the address of your WebHDFS service:

image

Since Power Query gets a recursive list back with a list of all files, I used a specific directory (/user/demo/csv) here in order to reduce the amount of files that are returned.

image

Please not the arrows-icon in the first column. This is used to get the content of all listed files (binary concatenation of the files). Here is the result:

image

As you can see, there are some lines with errors. The reason for that is, that we did a binary concatenation of the files here. As the last line of each of my csv-files does not end with a line break, the last line and the first line of each file are combined into a single line causing the error. You can see the binary combine operation, when checking the query steps. Therefore, expand the pane on the right and choose the “CombinedBinaries” step.

image

So, a binary combination might not be the best idea here. We will need to use another approach. First, let’s remove all the steps starting with the binary combine:

image

The ability to delete steps, or to move them within the query position is one of the great functionalities in Power Query: You can interactively design your query by trying certain operations, going back to every single step if necessary and try a different approach. And this process is fully visual: Clicking on each step shows the query result in that specific step.

So after deleting our first approach for the combined binaries, we’re back at the file list. For the next approach I’m only using the binary column, so I can remove all other columns here:

image

Then I insert a Custom column:

image

In the formula editor, we use this formula:

Csv.Document([Content])

image

Be aware that the function names are case sensitive. The full list of available functions is available as the Power Query Formula library specification. This and other resources for Power Query can be found here:

http://office.microsoft.com/en-us/excel-help/learn-about-power-query-formulas-HA104003958.aspx

Quite impressive, the formula library specification already has 257 pages with lots of useful functions. And remember, we’re still in the preview time of this product.

The result shows our new column, having “Table” as content:

image

Now, we don’t need the binary data anymore, so we can right-click / remove this column.

By clicking the arrow symbol in the column header, we can now expand the tables that resulted from our function call. Let’s include all columns:

image

After clicking “Ok”, we get the file contents, this time without the error line.

image

The only task left is to rename the columns properly. In my case, I name them Date, Product and Sales (right click/rename). You should also check the data type of each column. For example, we could convert the first column to data type “Date”:

image

The last step here is to rename the query (choose some comprehensive name) and to finish the query operation by clicking Ok in the query windows.

Back in Excel you may adjust the query behavior using the query settings pane:

image

For example, you could disable “Load to worksheet” and then click “Load to data model” in order to load the data directly into a Power Pivot model (bypassing Excel and therefore also bypassing Excel’s limitation of 1 million rows).

Querying big data sources like Hadoop might not be a common use case, but there may still some cases where this might be useful. And of course we could filter down the file list (for example to read only files starting with a given date) or the file content (with the reader function from above).

So, Power Query and the embedded M formula language offer very powerful tools for retrieving and transforming (reshaping) data. For most cases none or only little coding is required and the step by step engine allows to get back to a specific point in the transformation at any time. The use cases for Power Query are literally endless and this tool is definitely a big step ahead for bringing Business Intelligence Power to the end users.

Thursday, October 10, 2013

OECD PIAAC Study 2013 data available for analysis

Data Analysis | R | Tableau

The recent PIAAC study of OECD covers skills of adults in 24 countries. You can find the study results together with background information here: http://www.oecd.org/site/piaac/. The study is very interesting as it is the first study targeting adult skills for these countries. Fortunately OECD has made the case data of the study available for download under the ‘public use files’ section. Here is the direct link to the files: http://vs-web-fs-1.oecd.org/piaac/puf-data

Data is available in SPSS and SAS format. I went for the SPSS format, downloaded all files (all countries except Australia) and processed the files using R. Here is the script I created for this purpose:

# OECD PIAAC Study 2013
# Interesting links:
http://www.oecd.org/site/piaac/ 
http://www.oecd.org/site/piaac/publicdataandanalysis.htm
#  SPSS files: http://vs-web-fs-1.oecd.org/piaac/puf-data

library(foreign)
library(stringr)
library(reshape2)

setwd("C:\\temp\\OECD_PIAAC_Study_2013_SPSS")
file_list <- list.files(path=".")
for (file in file_list){ 
  if (!exists("piaac")){
    piaac <- read.spss(file, to.data.frame=T)
  } 
  if (exists("piaac")){
    temp_dataset <-read.spss(file, to.data.frame=T)
    piaac<-rbind(piaac, temp_dataset)
    rm(temp_dataset)
  }
}

# only include this set of variables:
collist_meta<-c('CNTRYID_E','GENDER_R','AGEG5LFS','B_Q01a','YEARLYINCPR')

# also include all score columns (10 per domain)
collist_lit<-names(piaac)[str_sub(names(piaac),start=1, end=5)=="PVLIT"]
collist_num<-names(piaac)[str_sub(names(piaac),start=1, end=5)=="PVNUM"]
collist_psl<-names(piaac)[str_sub(names(piaac),start=1, end=5)=="PVPSL"]
collist<-c(collist_meta,collist_lit,collist_num,collist_psl)
piaac2<-piaac[collist]

# combine score values to single values
p<-data.frame(piaac2[collist_meta], "Score_Lit"=rowMeans(piaac2[collist_lit]), "Score_Num"=rowMeans(piaac2[collist_num]), "Score_Psl"=rowMeans(piaac2[collist_psl]))

# remove incomplete cases
p<-melt(p, id=collist_meta)
p<-p[!is.na(p$value) & !is.na(p$CNTRYID_E),]

# write output to csv-file
write.table(p, col.names=T, row.names=F, file="c:\\temp\\piaac.csv", sep=",", dec=".", na="", qmethod="double")

Or course, we could include additional variables, but for the moment, let’s focus on the variables from the script above. A full list with all variables and details is available using the “International Codebook” link in the “public use files” section.

The values can then be imported to Tableau desktop for further analysis.

image

The official charts are looking more dramatic (click here for an example) as the value axis does not start with zero. This is a common method of making results look more impressive or – as Prof. Hichert says – it adds a certain “lie-factor” to the visualization where it’s up to the creator of the chart to scale this factor to whatever effect is desired.

A result that somewhat surprises me, was that the skills decrease with higher age, as shown below for the reading skill (Lit) in Germany:

image

Here is the map visualization (from red over grey to green, again for the reading skills) for the European region as Tableau automatically maps country names to geographic regions.

image

Another interesting chart shows the relationship between income (shown as quantile here) and test score (again for the reading skills, showing some selected countries):

image

And finally the comparison between highest education (according to the International Standard Classification of Education) and score (again shown for reading skills) for Germany:

image

I’m not going into interpretations here. There is a lot of material available on the OECD website. The study is also a great source for demographic data. So if you like to discover more maybe the R script above can help you getting started.

Sunday, October 6, 2013

Power Query: How to unpivot a dynamic number of columns

Power Query

A powerful function in Power Query is to unpivot a given data set which means to rotate data in columns to rows. This is useful for a lot of statistical data sets that you will find on the web because those data sets usually have the time (for example the year) on the columns. In order to process the data, you need to unpivot it first.

For my example, in order to be able to modify the table, I’m using a simple Excel table which looks like this:

image

I’m using this table as the source for a new query in Power Query:

image

In order to transform the columns into rows, I select all columns with years and choose unpivot from the context menu:

image

Here is the result:

image

This was quite easy. For my example, I’m leaving the query like this (usually you would go ahead and rename the columns etc.).

Now, what happens if a new columns is added to my source table. Let’s assume we’re adding a year for 2014.

By clicking Refresh in the context menu of the Excel table resulting from my Power Query, the query is executed again. The result looks like this:

image

As you can see, the year 2014 is not included in the unpivot operation but became an additional column. This is clearly understandable if we look at the generated M script:

let
    Source = Excel.CurrentWorkbook(){[Name="ProductPlan"]}[Content],
    Unpivot = Table.Unpivot(Source,{"2008", "2009", "2010", "2011", "2012", "2013"},"Attribute","Value")
in
    Unpivot

Since the Table.Unpivot function names each column that is to be included in the unpivoted operation, the new column is not reflected by the query script.

In the analytical language R, this task would by easier, since the melt-function, which is the corresponding function for unpivoting data in R, takes the columns that are to be fixed when unpivoting. So assuming the table above has been loaded in R as a data frame df, the unpivot operation would look like

df_unpivot<-melt(df, id=c("Product"))

But let’s get back to Power Query. In order to make our query aware of a different number of columns, we need to replace the constant column list with a variable one. Let’s do it step by step by modifying the script above.

First, we need a list of all the columns from our input table (Source):

let
    Source = Excel.CurrentWorkbook(){[Name="ProductPlan"]}[Content],
    Unpivot = Table.Unpivot(Source,{"2008", "2009", "2010", "2011", "2012", "2013"},"Attribute","Value"),
    Cols = Table.ColumnNames(Source)
in
    Cols

Modified code is shown in red. Please be aware of the comma at the end of the line starting with Unpivot=…

Also, quite interesting, we’re still having the Unpivot transformation from above in the M script, we’re just not showing it, as we use our recently created variable Cols as the output (in-clause). So the result is the list of column names from our table:

image

Usually, the first columns are to be fixed for the unpivot operation. So here, the function List.Skip is useful: We just skip the first column in order to get all the columns with the year values:

let
    Source = Excel.CurrentWorkbook(){[Name="ProductPlan"]}[Content],
    Unpivot = Table.Unpivot(Source,{"2008", "2009", "2010", "2011", "2012", "2013"},"Attribute","Value"),
    Cols = Table.ColumnNames(Source),
    ColsUnPivot=List.Skip(Cols, 1)
in
    ColsUnPivot

Again, the modified part is shown in red (and again, take care of the comma). This gives us the desired list of columns:

image

Now, all we have to do is to replace the constant list from the Unpivot function with the newly generated list ColsUnPivot. I’m moving the Unpivot operation to the end of the list and also use this as the query output. Here’s the resulting script:

let
    Source = Excel.CurrentWorkbook(){[Name="ProductPlan"]}[Content],   
    Cols = Table.ColumnNames(Source),
    ColsUnPivot=List.Skip(Cols, 1),
    Unpivot = Table.Unpivot(Source,ColsUnPivot,"Attribute","Value")
in
    Unpivot

Not surprisingly, the query returns exactly the same output as the query before. In order to see the difference, let’s add the year 2014 again to our source table and refresh the query from Power Query:

image

As you see, the result now contains the year 2014 on the rows, not on an extra column. This is exactly what we were trying to achieve.

Update Feb. 13, 2014: In the current version of Power Query, there is a new function UnpivotOtherColumns available which is much like the operation in R and simplifies the process a lot here. I still leave the post online for your information.