Search This Blog

Friday, May 29, 2015

Polybase in SQL Server 2016 CTP 2

SQL Server 2016 CTP2

One of the exciting new features in SQL Server 2016 that is already available in the public preview version (CTP 2) is Polybase, the transparent access bridge from SQL to Hadoop. This functionality has been available in the Microsoft Analytics Platform System (APS) since version 2012 and has now made it’s way into SQL Server. You can find a lot more information in this blog post for example: http://blogs.technet.com/b/dataplatforminsider/archive/2014/06/02/polybase-in-aps-yet-another-sql-over-hadoop-solution.aspx
This blog post covers the first steps for using Polybase in SQL Server 2016 community preview 2 (CTP 2).
First of all, you need to install Polybase functionality during the installation process by checking the appropriate option:



This installs two services for Polybase as shown below:



Also, when opening a database in SQL Server Management Studio you’ll see some new objects:



External tables are your gateway to hadoop. Later in this post we will create an external table to a Hadoop file which can then be used mostly like a normal SQL Server table.
External data sources define the connection to the Hadoop cluster (more precisely to the Hadoop file system, HDFS), while external file formats are used to define the structure of the file on the Hadoop cluster.
Before we can start, we need to configure SQL Server for Hadoop connectivity. The following code is an example:

EXEC sp_configure 'hadoop connectivity', 5;
GO
RECONFIGURE;
GO 


You can use the following configuration values:
0 no Hadoop connectivity (default)
3 Enable connectivity to Cloudera CDH for Linux
4 Enable connectivity to Hortonworks Data Platform for Windows Server (HDP), HDInsight on Analytics Platform System, or HDInsight’s Microsoft Azure blob storage
5 Enable connectivity to Hortonworks Data Platform (HDP)for Linux

In my case, I’m using a Hortonworks Sandbox on Linux, so I’m using the configuration value 5.
Also, the user that is used to access Hadoop has to be configured in the Polybase configuration file which can be found in the Binn\Polybase\Settings sub directory of the instance, in my case

C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\Binn\Polybase\Settings\Hadoop.config

The default user is pdw_user which I changed to ‘hue’ here:
 



To make all these changes active, I had to restart SQL Server and the two Polybase services.

Now, let’s create an external data source. You can create a code snippet from the context menu of the external data sources folder. Here is the code I’m using for my sandbox:

CREATE EXTERNAL DATA SOURCE HDP2 WITH
(
    TYPE = HADOOP,
    LOCATION = 'hdfs://sandbox.hortonworks.com:8020'
)


Next, we need to create an external file format. I’d like to access the tab delimited file sample_07 from Hive which is shipped with the sandbox. Therefore the external file format looks like this:

CREATE EXTERNAL FILE FORMAT TSV
WITH (
    FORMAT_TYPE = DELIMITEDTEXT,
    FORMAT_OPTIONS (
        FIELD_TERMINATOR = '\t',
        DATE_FORMAT = 'MM/dd/yyyy'
    )
)


The date format is not needed for my case (there is no date column in this file) but just there to show the functionality.

The two new server objects now appear in the “External Resources” section of the database.



Next, we can create the external table itself using the defined data source and file format.

create external table sample_07
(
    code nvarchar(255),
    description nvarchar(255),
    total_emp int,
    salary nvarchar(255)
)
WITH
(
    LOCATION = '/apps/hive/warehouse/sample_07',
    DATA_SOURCE = HDP2,
    FILE_FORMAT = TSV,
    REJECT_TYPE = value,
    REJECT_VALUE=0
)


The new external table appears in the “External Tables” folder of the database.



We can now query the table like any other table in the database, for example:

select * from Sample_07




Here is another example

select left(code,2) code2, sum(total_emp) total_emp, count(*) Count
from Sample_07
group by left(code,2)


And of course you can also write queries with joins between Polybase external tables and local SQL Server tables which gives a transparant SQL query experience to Hadoop data.

Final notes
  • The functionality and syntax may change in the final product, so this only applies to the technical preview 2 (CTP2)
  • Currently, writing to a hadoop file is not supported (there should be something like the “create external table as select” syntax in the APS)
  • This post is just a brief overview. There are many more options (for example more file formats, compression etc.) that can be used. A good orientation can be found in the Microsoft APS support tools download which can be found here: http://www.microsoft.com/en-us/download/details.aspx?id=45294

Friday, May 15, 2015

Incremental data loads in Microsoft Power Query for Excel

Power Query

Power Query is a great and flexible tool for getting and transforming data from different sources into Excel or Power Pivot. The standard procedure for Power Query is to read a full table and to replace the destination with the result of the current query. This is exactly what we need for most use cases of Power Query. However, if you like to add new data to existing data you can still use Power Query but you have to follow a slightly different approach.

For example, let’s say we want to create a list of the blog posts while the RSS feed only delivers the most recent posts. In this case we would need to add the the query results to the existing data.

Let’s start with my olap blog. The RSS feed delivers only the last 25 entries of my blog. In order to load this with Power Query, I’m using the load from web function:

image

The dialog asks for the URL (I’m using my blog’s mirror here at http://ms-olap.blogspot.com/feeds/posts/default)

image

If Power Query does not automatically detect this as an XML table you can insert the XML.Tables(…) function as shown below:

image

The items are stored in the last column (entry table) which can be expanded by clicking on image button right between “entry”. For my example I’m only interested in the publication data and the title.

image

By expanding this table we get one row per blog post. In order the get the title, I’m expanding the last column (in my case title.1 since there is already a column title for the title of the blog) to its text value:

image

Finally, I changed the type of the first column to DateTime and I renamed the columns as shown below:

image

The full script generated by Power Query so far looks like this:

let
    Source = Xml.Tables(Web.Contents("
http://ms-olap.blogspot.com/feeds/posts/default")),
    #"Expand entry" = Table.ExpandTableColumn(Source, "entry", {"published", "title"}, {"published", "title.1"}),
    #"Expand title.1" = Table.ExpandTableColumn(#"Expand entry", "title.1", {"Element:Text"}, {"Element:Text"}),
    #"Removed Columns" = Table.RemoveColumns(#"Expand title.1",{"id", "updated", "category", "title", "subtitle", "link", "author", "generator", "
http://a9.com/-/spec/opensearchrss/1.0/"}),
    #"Split Column by Position" = Table.SplitColumn(#"Removed Columns","published",Splitter.SplitTextByPositions({0, 19}, false),{"published.1", "published.2"}),
    #"Changed Type" = Table.TransformColumnTypes(#"Split Column by Position",{{"published.1", type datetime}, {"published.2", type text}}),
    #"Removed Columns1" = Table.RemoveColumns(#"Changed Type",{"published.2"}),
    #"Renamed Columns" = Table.RenameColumns(#"Removed Columns1",{{"published.1", "Date"}, {"Element:Text", "Title"}})
in
    #"Renamed Columns"

We can now load this result to the Workbook by clicking the “Close & Load” button. Now, here comes the interesting part. Up till now, the RSS feed only delivers the last 25 rows, so whenever there is a new post, the oldest post would be removed from the table. Let’s say we’d like to add new posts to the table.

To do so, I’m creating a new query named “Existing data” from the data we just loaded into the workbook:

image

Using the “Save & Load To …” function in the ribbon, we can choose to only create a connection (no data is actually moved here):

image

The workbook queries pane now looks like this:

image

We can now edit the first query (“BlogPosts”) again. In the home tab of the Power Query ribbon you can find the options for combining queries.

image

In my case, I simple decide to append the new data to the existing one. If you have a key to match the same rows, you could also do a merge based on the key column (for example if there are updated rows).

image

The result is that we now have 25 rows from the existing table plus 25 rows recently read from the blog. At this point of time, the entries are identical so we have 25 duplicates. I remove the duplicated rows here using the “Remove duplicates” function.

image

We can now save & load the query again. In order to demonstrate the effect, I’m pasting the remaining blog posts (which cannot be pulled by the RSS feed) into the Excel table:

image

Summary
While full loads usually are the most usefull approach for loading data with Power Query, you can also build incremental load procedures by joining or appending the new data to the existing data.