Search This Blog

Sunday, October 20, 2013

Using Power Query to combine files from Hadoop into a single table (part 1)

Hadoop | HDInsight | Power Query Preview

For today’s post, I’m showing a scenario where I want to load some text files from my Hadoop cluster into a single Excel Power Pivot table. For this purpose I created some files with meaningless content and three columns (date, product, sales). We will see, that Power Pivot offers builtin support for this use case from it’s GUI (ability to combine multiple files), but we will use a custom import function for each of the files. This gives us the maximum of flexibility to control the import process.

I stored those files on hdfs in a directory /users/demo/csv. The purpose is to load all the files from this directory into a single table.

One more thing before we get started. This post if more intended to demonstrate the functionality of Power Query functions. Of course, if you have many small files on your hadoop cluster, the better solution would be to assemble the files on the cluster itself. In my case I could launch the following pig latin script to create a joined file:

a = LOAD '/user/demo/csv/*.csv' using PigStorage(',') AS (Date:chararray, Product:int, Sales:float);
STORE a INTO '/user/demo/result.csv';

Here are the first output rows of this script (when using dump a):

image

But let’s try to achieve the same result using Power Query. This post will be separated in two parts:

part 1: Combine/import the files using Power BI basic functionality

part 2: Write a custom function to combine/import the csv files

 

For part 1 I will first connect to the Hadoop cluster. So, I open a blank Excel file, go to the Power Query ribbon and choose “From Other Sources/From Hadoop File (HDFS)”:

image

In the following dialog, you need to enter the address of your WebHDFS service:

image

Since Power Query gets a recursive list back with a list of all files, I used a specific directory (/user/demo/csv) here in order to reduce the amount of files that are returned.

image

Please not the arrows-icon in the first column. This is used to get the content of all listed files (binary concatenation of the files). Here is the result:

image

As you can see, there are some lines with errors. The reason for that is, that we did a binary concatenation of the files here. As the last line of each of my csv-files does not end with a line break, the last line and the first line of each file are combined into a single line causing the error. You can see the binary combine operation, when checking the query steps. Therefore, expand the pane on the right and choose the “CombinedBinaries” step.

image

So, a binary combination might not be the best idea here. We will need to use another approach. First, let’s remove all the steps starting with the binary combine:

image

The ability to delete steps, or to move them within the query position is one of the great functionalities in Power Query: You can interactively design your query by trying certain operations, going back to every single step if necessary and try a different approach. And this process is fully visual: Clicking on each step shows the query result in that specific step.

So after deleting our first approach for the combined binaries, we’re back at the file list. For the next approach I’m only using the binary column, so I can remove all other columns here:

image

Then I insert a Custom column:

image

In the formula editor, we use this formula:

Csv.Document([Content])

image

Be aware that the function names are case sensitive. The full list of available functions is available as the Power Query Formula library specification. This and other resources for Power Query can be found here:

http://office.microsoft.com/en-us/excel-help/learn-about-power-query-formulas-HA104003958.aspx

Quite impressive, the formula library specification already has 257 pages with lots of useful functions. And remember, we’re still in the preview time of this product.

The result shows our new column, having “Table” as content:

image

Now, we don’t need the binary data anymore, so we can right-click / remove this column.

By clicking the arrow symbol in the column header, we can now expand the tables that resulted from our function call. Let’s include all columns:

image

After clicking “Ok”, we get the file contents, this time without the error line.

image

The only task left is to rename the columns properly. In my case, I name them Date, Product and Sales (right click/rename). You should also check the data type of each column. For example, we could convert the first column to data type “Date”:

image

The last step here is to rename the query (choose some comprehensive name) and to finish the query operation by clicking Ok in the query windows.

Back in Excel you may adjust the query behavior using the query settings pane:

image

For example, you could disable “Load to worksheet” and then click “Load to data model” in order to load the data directly into a Power Pivot model (bypassing Excel and therefore also bypassing Excel’s limitation of 1 million rows).

Querying big data sources like Hadoop might not be a common use case, but there may still some cases where this might be useful. And of course we could filter down the file list (for example to read only files starting with a given date) or the file content (with the reader function from above).

So, Power Query and the embedded M formula language offer very powerful tools for retrieving and transforming (reshaping) data. For most cases none or only little coding is required and the step by step engine allows to get back to a specific point in the transformation at any time. The use cases for Power Query are literally endless and this tool is definitely a big step ahead for bringing Business Intelligence Power to the end users.

Thursday, October 10, 2013

OECD PIAAC Study 2013 data available for analysis

Data Analysis | R | Tableau

The recent PIAAC study of OECD covers skills of adults in 24 countries. You can find the study results together with background information here: http://www.oecd.org/site/piaac/. The study is very interesting as it is the first study targeting adult skills for these countries. Fortunately OECD has made the case data of the study available for download under the ‘public use files’ section. Here is the direct link to the files: http://vs-web-fs-1.oecd.org/piaac/puf-data

Data is available in SPSS and SAS format. I went for the SPSS format, downloaded all files (all countries except Australia) and processed the files using R. Here is the script I created for this purpose:

# OECD PIAAC Study 2013
# Interesting links:
http://www.oecd.org/site/piaac/ 
http://www.oecd.org/site/piaac/publicdataandanalysis.htm
#  SPSS files: http://vs-web-fs-1.oecd.org/piaac/puf-data

library(foreign)
library(stringr)
library(reshape2)

setwd("C:\\temp\\OECD_PIAAC_Study_2013_SPSS")
file_list <- list.files(path=".")
for (file in file_list){ 
  if (!exists("piaac")){
    piaac <- read.spss(file, to.data.frame=T)
  } 
  if (exists("piaac")){
    temp_dataset <-read.spss(file, to.data.frame=T)
    piaac<-rbind(piaac, temp_dataset)
    rm(temp_dataset)
  }
}

# only include this set of variables:
collist_meta<-c('CNTRYID_E','GENDER_R','AGEG5LFS','B_Q01a','YEARLYINCPR')

# also include all score columns (10 per domain)
collist_lit<-names(piaac)[str_sub(names(piaac),start=1, end=5)=="PVLIT"]
collist_num<-names(piaac)[str_sub(names(piaac),start=1, end=5)=="PVNUM"]
collist_psl<-names(piaac)[str_sub(names(piaac),start=1, end=5)=="PVPSL"]
collist<-c(collist_meta,collist_lit,collist_num,collist_psl)
piaac2<-piaac[collist]

# combine score values to single values
p<-data.frame(piaac2[collist_meta], "Score_Lit"=rowMeans(piaac2[collist_lit]), "Score_Num"=rowMeans(piaac2[collist_num]), "Score_Psl"=rowMeans(piaac2[collist_psl]))

# remove incomplete cases
p<-melt(p, id=collist_meta)
p<-p[!is.na(p$value) & !is.na(p$CNTRYID_E),]

# write output to csv-file
write.table(p, col.names=T, row.names=F, file="c:\\temp\\piaac.csv", sep=",", dec=".", na="", qmethod="double")

Or course, we could include additional variables, but for the moment, let’s focus on the variables from the script above. A full list with all variables and details is available using the “International Codebook” link in the “public use files” section.

The values can then be imported to Tableau desktop for further analysis.

image

The official charts are looking more dramatic (click here for an example) as the value axis does not start with zero. This is a common method of making results look more impressive or – as Prof. Hichert says – it adds a certain “lie-factor” to the visualization where it’s up to the creator of the chart to scale this factor to whatever effect is desired.

A result that somewhat surprises me, was that the skills decrease with higher age, as shown below for the reading skill (Lit) in Germany:

image

Here is the map visualization (from red over grey to green, again for the reading skills) for the European region as Tableau automatically maps country names to geographic regions.

image

Another interesting chart shows the relationship between income (shown as quantile here) and test score (again for the reading skills, showing some selected countries):

image

And finally the comparison between highest education (according to the International Standard Classification of Education) and score (again shown for reading skills) for Germany:

image

I’m not going into interpretations here. There is a lot of material available on the OECD website. The study is also a great source for demographic data. So if you like to discover more maybe the R script above can help you getting started.

Sunday, October 6, 2013

Power Query: How to unpivot a dynamic number of columns

Power Query

A powerful function in Power Query is to unpivot a given data set which means to rotate data in columns to rows. This is useful for a lot of statistical data sets that you will find on the web because those data sets usually have the time (for example the year) on the columns. In order to process the data, you need to unpivot it first.

For my example, in order to be able to modify the table, I’m using a simple Excel table which looks like this:

image

I’m using this table as the source for a new query in Power Query:

image

In order to transform the columns into rows, I select all columns with years and choose unpivot from the context menu:

image

Here is the result:

image

This was quite easy. For my example, I’m leaving the query like this (usually you would go ahead and rename the columns etc.).

Now, what happens if a new columns is added to my source table. Let’s assume we’re adding a year for 2014.

By clicking Refresh in the context menu of the Excel table resulting from my Power Query, the query is executed again. The result looks like this:

image

As you can see, the year 2014 is not included in the unpivot operation but became an additional column. This is clearly understandable if we look at the generated M script:

let
    Source = Excel.CurrentWorkbook(){[Name="ProductPlan"]}[Content],
    Unpivot = Table.Unpivot(Source,{"2008", "2009", "2010", "2011", "2012", "2013"},"Attribute","Value")
in
    Unpivot

Since the Table.Unpivot function names each column that is to be included in the unpivoted operation, the new column is not reflected by the query script.

In the analytical language R, this task would by easier, since the melt-function, which is the corresponding function for unpivoting data in R, takes the columns that are to be fixed when unpivoting. So assuming the table above has been loaded in R as a data frame df, the unpivot operation would look like

df_unpivot<-melt(df, id=c("Product"))

But let’s get back to Power Query. In order to make our query aware of a different number of columns, we need to replace the constant column list with a variable one. Let’s do it step by step by modifying the script above.

First, we need a list of all the columns from our input table (Source):

let
    Source = Excel.CurrentWorkbook(){[Name="ProductPlan"]}[Content],
    Unpivot = Table.Unpivot(Source,{"2008", "2009", "2010", "2011", "2012", "2013"},"Attribute","Value"),
    Cols = Table.ColumnNames(Source)
in
    Cols

Modified code is shown in red. Please be aware of the comma at the end of the line starting with Unpivot=…

Also, quite interesting, we’re still having the Unpivot transformation from above in the M script, we’re just not showing it, as we use our recently created variable Cols as the output (in-clause). So the result is the list of column names from our table:

image

Usually, the first columns are to be fixed for the unpivot operation. So here, the function List.Skip is useful: We just skip the first column in order to get all the columns with the year values:

let
    Source = Excel.CurrentWorkbook(){[Name="ProductPlan"]}[Content],
    Unpivot = Table.Unpivot(Source,{"2008", "2009", "2010", "2011", "2012", "2013"},"Attribute","Value"),
    Cols = Table.ColumnNames(Source),
    ColsUnPivot=List.Skip(Cols, 1)
in
    ColsUnPivot

Again, the modified part is shown in red (and again, take care of the comma). This gives us the desired list of columns:

image

Now, all we have to do is to replace the constant list from the Unpivot function with the newly generated list ColsUnPivot. I’m moving the Unpivot operation to the end of the list and also use this as the query output. Here’s the resulting script:

let
    Source = Excel.CurrentWorkbook(){[Name="ProductPlan"]}[Content],   
    Cols = Table.ColumnNames(Source),
    ColsUnPivot=List.Skip(Cols, 1),
    Unpivot = Table.Unpivot(Source,ColsUnPivot,"Attribute","Value")
in
    Unpivot

Not surprisingly, the query returns exactly the same output as the query before. In order to see the difference, let’s add the year 2014 again to our source table and refresh the query from Power Query:

image

As you see, the result now contains the year 2014 on the rows, not on an extra column. This is exactly what we were trying to achieve.

Update Feb. 13, 2014: In the current version of Power Query, there is a new function UnpivotOtherColumns available which is much like the operation in R and simplifies the process a lot here. I still leave the post online for your information.