Search This Blog

Friday, May 15, 2015

Incremental data loads in Microsoft Power Query for Excel

Power Query

Power Query is a great and flexible tool for getting and transforming data from different sources into Excel or Power Pivot. The standard procedure for Power Query is to read a full table and to replace the destination with the result of the current query. This is exactly what we need for most use cases of Power Query. However, if you like to add new data to existing data you can still use Power Query but you have to follow a slightly different approach.

For example, let’s say we want to create a list of the blog posts while the RSS feed only delivers the most recent posts. In this case we would need to add the the query results to the existing data.

Let’s start with my olap blog. The RSS feed delivers only the last 25 entries of my blog. In order to load this with Power Query, I’m using the load from web function:

image

The dialog asks for the URL (I’m using my blog’s mirror here at http://ms-olap.blogspot.com/feeds/posts/default)

image

If Power Query does not automatically detect this as an XML table you can insert the XML.Tables(…) function as shown below:

image

The items are stored in the last column (entry table) which can be expanded by clicking on image button right between “entry”. For my example I’m only interested in the publication data and the title.

image

By expanding this table we get one row per blog post. In order the get the title, I’m expanding the last column (in my case title.1 since there is already a column title for the title of the blog) to its text value:

image

Finally, I changed the type of the first column to DateTime and I renamed the columns as shown below:

image

The full script generated by Power Query so far looks like this:

let
    Source = Xml.Tables(Web.Contents("
http://ms-olap.blogspot.com/feeds/posts/default")),
    #"Expand entry" = Table.ExpandTableColumn(Source, "entry", {"published", "title"}, {"published", "title.1"}),
    #"Expand title.1" = Table.ExpandTableColumn(#"Expand entry", "title.1", {"Element:Text"}, {"Element:Text"}),
    #"Removed Columns" = Table.RemoveColumns(#"Expand title.1",{"id", "updated", "category", "title", "subtitle", "link", "author", "generator", "
http://a9.com/-/spec/opensearchrss/1.0/"}),
    #"Split Column by Position" = Table.SplitColumn(#"Removed Columns","published",Splitter.SplitTextByPositions({0, 19}, false),{"published.1", "published.2"}),
    #"Changed Type" = Table.TransformColumnTypes(#"Split Column by Position",{{"published.1", type datetime}, {"published.2", type text}}),
    #"Removed Columns1" = Table.RemoveColumns(#"Changed Type",{"published.2"}),
    #"Renamed Columns" = Table.RenameColumns(#"Removed Columns1",{{"published.1", "Date"}, {"Element:Text", "Title"}})
in
    #"Renamed Columns"

We can now load this result to the Workbook by clicking the “Close & Load” button. Now, here comes the interesting part. Up till now, the RSS feed only delivers the last 25 rows, so whenever there is a new post, the oldest post would be removed from the table. Let’s say we’d like to add new posts to the table.

To do so, I’m creating a new query named “Existing data” from the data we just loaded into the workbook:

image

Using the “Save & Load To …” function in the ribbon, we can choose to only create a connection (no data is actually moved here):

image

The workbook queries pane now looks like this:

image

We can now edit the first query (“BlogPosts”) again. In the home tab of the Power Query ribbon you can find the options for combining queries.

image

In my case, I simple decide to append the new data to the existing one. If you have a key to match the same rows, you could also do a merge based on the key column (for example if there are updated rows).

image

The result is that we now have 25 rows from the existing table plus 25 rows recently read from the blog. At this point of time, the entries are identical so we have 25 duplicates. I remove the duplicated rows here using the “Remove duplicates” function.

image

We can now save & load the query again. In order to demonstrate the effect, I’m pasting the remaining blog posts (which cannot be pulled by the RSS feed) into the Excel table:

image

Summary
While full loads usually are the most usefull approach for loading data with Power Query, you can also build incremental load procedures by joining or appending the new data to the existing data.

11 comments:

  1. Can Power Query update existing rows as well (identified by a key). I'm thinking of an approach where I get feeds of new data, in addition to changes to some existing data. Thanks!!

    ReplyDelete
  2. Hi Dan, If you like to update existing rows, you would use the "Merge Queries" function (instead of "Append Queries"). Just make sure not the check the "only include matching rows" check box as you still want to preserve your existing rows if no updated Information is available. The merge criteria would be your key column. This would result in a data set with the old/existing columns and the new columns. You can then apply a formula that uses the new value if it exists (i.e. if it isn't empty) or the old value otherwise. This should solve your requirement.

    ReplyDelete
  3. How would you do this in power bi desktop? You can create a table based on the load of the first query, but how do you then create a second query based off of that table?

    ReplyDelete
  4. In order to build a query based on the output of an existing query in Power BI Desktop you just create a new (blank) query and type = in the command line for the first step. You should then get the output of the existing query.

    ReplyDelete
  5. I tried that. If you create a query like this, it doesn't let you use that second query to append it to the first one.

    ReplyDelete
  6. I suppose you can't do this trick in Power BI Desktop. It let's you use an existing query but then you cannot load into the same source. In Excel Power Query I used Excel as an additional table store to separate the processes but this functionality isn't available in Power BI Desktop.

    ReplyDelete
  7. So the "Existing data" itself HAS to be the result of a previous Power Query for this to work, right?

    You can't append any old data to an existing table?

    ReplyDelete
  8. The existing data is a query, however the query returns the existing data from the Excel sheet. So basically, you load the existing data from the Excel table, merge it with the new data and finally write the result to the same Excel table. As mentioned above, this only works in Excel (Get & Transform Data), not in Power BI Desktop.

    ReplyDelete
  9. Hi is it possible to pull the data for the existing data enquiry, without that first enquiry outputting to a table.

    I'm trying to have incremental load, which will directly feed into a connected pivot-table and wish to avoid having to have all the raw data in a table in the same workbook

    ReplyDelete
  10. Unfortunately not (at least with Power BI in Excel): The query needs to return all the data rows. Therefore we had to query the existing rows before to implement delta loads. Basically, Power BI and Excel Get&Transform both are not designed for this scenario. If you're dealing with more sophisticated data management topics (like delta loads), you may want to implement data management in a database (e.g. Azure SQL Database).

    ReplyDelete
  11. I have a created a large dataset with the incremental data loads and incremental updates. Is there a way to filter down the existing dataset first before I do the incremental update and then append the rest of the dataset? (I would prefer not to run a merge across all the rows when I know which rows may have an update)

    ReplyDelete