Search This Blog

Loading...

Tuesday, February 17, 2015

Querying Hadoop from SQL Server

SQL Server 2012 | SQL Server 2014

Microsoft’s Analytics Platform System (APS) offers built in transparent access to Hadoop data sources through the Polybase technology. This includes bidirectional access not only to Hadoop but also to Cloud services. The SMP SQL Server currently doesn’t contain Polybase, so access to Hadoop needs to be handled differently. Will Polybase be available in an upcoming SMP SQL Server? From the past we saw some technology making its way from PDW to SMP SQL Server, for example the clustered columnstore index, the cardinality estimation or the batch mode table operations. So let’s hope that Polybase makes it into the SMP SQL Server soon. Until then, one option is to use the HortonWorks ODBC driver and linked tables. To be honest, Polybase is a much more powerful technology since it uses cost-based cross platform query optimization which includes the ability to push down tasks to the Hadoop cluster when it makes sense. Also, Polybase doesn’t rely on Hive but access the files directly in parallel, thus giving a great performance. Linked tables are less powerful but may still be useful for some cases.

So, here we go. First, you need to download the ODBC driver from the Hortonworks add-ons page: http://hortonworks.com/hdp/addons/.

Make sure you pick the right version (32 bit/64 bit) for your operating system. After the installation completes, we need to set up an ODBC connection. Therefore, start the ODBC Datasource Adminstrator (Windows+S, then type ‘ODBC’). Again, make sure to start the correct version (32 bit/64 bit). The installer has already created a connection but you still need to supply the connection properties. I created a new connection instead:

p1

I’m connecting to the Hortonworks Sandbox here (HDP 2.1, I had problems connecting to HDP 2.2 with the current version of the ODBC driver). Instead of the host name you can also enter the IP address (usually 127.0.0.1 for the sandbox) but in order to get other tools running (like Redgate Hdfs Explorer) I configured the sandbox virtual machine to run on a bridged network and put the bridge network IP address of the sandbox (console command “ip addr”) in my local host file.

You should now click on Test to verify that the connection actually works:

p2

In SQL Server Management Studio we can now create a linked server connection to the Hadoop system using the following command:

EXEC master.dbo.sp_addlinkedserver
@server = N'Hadoop',
@srvproduct=N'HIVE',
@provider=N'MSDASQL',
@datasrc=N'HDP',
@provstr=N'Provider=MSDASQL.1;Persist Security Info=True;User ID=hue;'

Depending on you Hadoop’s security settings, you might need to provide a password for the provider string as well. The @server name is used to refer to the linked server later while the @datasrc names the ODBC connection (see “Data Source Name” in the configuration dialog of the connection above).

With the new linked server, we can now explore the Hive database in Management Studio:

p3

In order to run a query on for example table “sample_07” you can user one of the following commands:

select * from openquery (Hadoop, 'select * from Sample_07')

or

select * from [Hadoop].[HIVE].[default].[sample_07]

For both queries, “Hadoop” refers to the name of the linked server (@server parameter in the SQL statement from above).

If you get the following error message, this means that you are not allowed to query the table:

OLE DB provider "MSDASQL" for linked server "Hadoop" returned message "[Hortonworks][HiveODBC] (35) Error from Hive: error code: '40000' error message: 'Error while compiling statement: FAILED: HiveAccessControlException Permission denied. Principal [name=hue, type=USER] does not have following privileges on Object [type=TABLE_OR_VIEW, name=default.sample_07] : [SELECT]'.".
Msg 7306, Level 16, State 2, Line 1
Cannot open the table ""HIVE"."default"."sample_08"" from OLE DB provider "MSDASQL" for linked server "Hadoop".

In this case, you should simply give the user from you ODBC connection the SELECT right. To do so, run the following query in Hive:

grant select on sample_07 to user hue;

That’s it. You should now get the contents of the table in SQL Server:

p4

You might want to set the length of string columns manually because Hive does not return the size of the string column (in Hive, the column type is simply “string”). The size returned from the query results from the advanced ODBC-settings of our connection. I left everything on default here, so here is how it looks:

image

So, the default string column length is 255 here. Let’ check and copy the data over to SQL Server:

select * into sample_07 from [Hadoop].[HIVE].[default].[sample_07]

The resulting table looks like this:

image

To have a more precise control of the column length, you should use the convert function here, for example:

select
convert(nvarchar(50),[code]) [code],
convert(nvarchar(80),[description]) [description],
total_emp,
salary
from [Hadoop].[HIVE].[default].[sample_07]

Be careful with the remaining setting in the advanced options dialog. For example, checking “Use native query” means that you pass the query (openquery-Syntax) as it is to Hive. This could be intended to fully leverage specific features of Hive, but this could also lead to errors if you’re not familiar with the HiveQL query syntax. Also, to get a better with larger tables you might want to adjust the “Rows fetched per block” option to a larger value.

With HDP 2.2 you should also be able to write to the table (create a new table, grant all permissions and run an insert into) but I couldn’t do on my HDP 2.1 machine.

Summary

Until Polybase makes it into the SMP SQL Server product, Hadoop data may be queried from SQL Server using the ODBC driver and the linked server object. This could also be an option for Analysis Services to connect to Hadoop by using SQL Server views via linked server, since Analysis Services doesn’t support ODBC in multi dimensional mode. However, Polybase on the APS gives a much better performance because of the intelligent cross platform query optimizer and Polybase can also be used to write data to Hadoop, so I hope we’ll find this technology in the SMP SQL Server soon.

Tuesday, February 10, 2015

Hive file format comparison

Apache Hive 0.11+

in this post I’d like to compare the different file formats for Hive, as well as the different execution times for queries depending on the file format, compression and execution engine. As for the data, I’m using the Uber data set that I also used in my last post. I’m aware that the following query stats are not exhaustive and you may get to different results in your environment or with other table formats. Also you could try different serdes for Hive as well as consider compression. Still, it gives you some idea that both the file format and the execution engine play an important role for Hive’s query performance. However, when choosing a file format, you may also consider data management aspects. For example, if you get your source files in CSV format, than you will likely process the files in this format at least during the first process step.

As test queries I used the query to measure the total trip time (query 1) and the query to find all trips ending at San Francisco airport (query 2) from my last post. Here is the result for the file formats I tested:

image

Here is some information about the different file formats being used here:

File format Description
textfile separated text file (for example tab separated fields)
rcfile internal hive format (binary)
orc columnar storage format (highly compressed, binary, introduced with Hive 0.11)
Link: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.0.2/ds_Hive/orcfile.html
parquet columnar storage format (compressed, binary)
Link: http://parquet.incubator.apache.org/
Parquet is supported by a plugin in Hive since version 0.10 and natively in Hive 0.13 and later.
avro serialization file format from Apache Avro (contains schema and data, tools available for processing).
Link: http://avro.apache.org/
Avro is supported in Hive since version 0.9.

For Table create/Write Time I measured a “create table as select” (CTAS) into the specific format. As you can see, the resulting size of the table depends a lot on the file format. The columnar Orc file format compresses the data in a very efficient way:

image

Using Tez as the execution engine (set hive.execution.engine=tez) results in a much better performance compared to map reduce (set hive.execution.engine=mr). The total time for the two queries is shown in this table:

image

In map reduce mode, query time does not seem to depend too much on the file format being used:

image

However, when running the queries in Tez, you’ll see a significant difference between file formats like Parquet and Orc (with Orc being about 30% faster)

image

Tuesday, February 3, 2015

Geo spatial data support for Hive

Hive 0.1x

Apache Hive doesn’t have support for geo spatial functions out of the box. However, it’s easy to extend the functionality by using external libraries, that can be called via user defined functions (UDFs). This post shows how to use the ESRI Hadoop libraries in Hive. I’m using the Hortonworks HDP sandbox here, so it’s easy to reproduce the examples from below.

 

Preparation

In order to use the libraries, they first need to be installed. I’m using the following script to download, build and deploy the libraries to the Hadoop file system. Please note that you may have to correct the build number of the jar files. You will also need maven to build the libraries, so you need to install this prior to running the following commands:

# create working directories
cd /tmp
mkdir gis
cd gis
mkdir framework
mkdir api

# download and build framework
cd framework
wget https://github.com/Esri/spatial-framework-for-hadoop/archive/master.zip
unzip master
cd spatial-framework-for-hadoop-master
mvn clean package
hadoop fs -put hive/target/spatial-sdk-hive-1.0.3-SNAPSHOT.jar /user/esri
hadoop fs -put json/target/spatial-sdk-json-1.0.3-SNAPSHOT.jar /user/esri
hadoop fs -put hive/function-ddl.sql /user/esri

# download and build api
cd /tmp/gis/api
wget https://github.com/Esri/geometry-api-java/archive/master.zip
unzip master
cd geometry-api-java-master
mvn clean package
hadoop fs -put target/esri-geometry-api-1.2.jar /user/esri

# clean up
cd /tmp
rm -rf gis

If successful you should see the following files in your Hadoop file system, directory /user/esri:

image

 

Making the ESRI functions available in Hive

After building the ESRI libraries, we still need to declare the new user defined functions within Hive. First we start with the jar-files:

add jar hdfs:///user/esri/spatial-sdk-hive-1.0.3-SNAPSHOT.jar;
add jar hdfs:///user/esri/spatial-sdk-json-1.0.3-SNAPSHOT.jar;
add jar hdfs:///user/esri/esri-geometry-api-1.2.jar;

Running a “list jars” in Hive shows the libraries being loaded successfully:

image

Please note that you can add the add jar commands to the .hiverc file. You can find more details on that file in this blog post.

Next we do have to declare the spatial functions. You will find the full declaration for all functions in the function-ddl.sql file that I also placed in hdfs under /user/esri (as shown in the screenshot from above) using the build script. However, for this post, I’m only using a few functions with the following declarations:

create temporary function ST_Point as 'com.esri.hadoop.hive.ST_Point';
create temporary function ST_LineString as 'com.esri.hadoop.hive.ST_LineString';
create temporary function ST_Length as 'com.esri.hadoop.hive.ST_Length';
create temporary function ST_GeodesicLengthWGS84 as 'com.esri.hadoop.hive.ST_GeodesicLengthWGS84';
create temporary function ST_SetSRID as 'com.esri.hadoop.hive.ST_SetSRID';
create temporary function ST_Polygon as 'com.esri.hadoop.hive.ST_Polygon';
create temporary function ST_Intersects as 'com.esri.hadoop.hive.ST_Intersects';

create temporary function ST_Polygon as 'com.esri.hadoop.hive.ST_Polygon';
create temporary function ST_Intersects as 'com.esri.hadoop.hive.ST_Intersects';

Starting with Hive 0.13 you can also declare these functions permanently by omitting the “temporary” keyword from the declarations from above. This has the advantage that you do not need to declare the functions for every session. You can also include the jar file in the create function statement which makes it easier to create a permanent declaration.

For example, for the definition of the ST_Point function you would write the following SQL statement:

create function ST_Point as 'com.esri.hadoop.hive.ST_Point' using jar 'hdfs:///user/esri/spatial-sdk-hive-1.0.3-SNAPSHOT.jar';

Using the ESRI functions in Hive

In order to demonstrate the ESRI geo spatial functions I’m using the Uber data set with 25,000 Uber routes from San Francisco.

The first rows of the data are looking like this:

image

Each trip has a id (1 to 25,000) with the latitudes and longitudes at each time. For example, using window functions we can easily calculate the total time for each trip (by using first_value to get the starting time and restricting the result set to the last entry per trip):

-- get trip duration
select id, unix_timestamp(dt)-unix_timestamp(dt_first) trip_duration
from (
select
id,dt,
first_value(dt) over (partition by id order by dt) dt_first,
row_number() over (partition by id order by dt desc) lnr
from uber
) Sub
Where lnr=1;

image

 

Ok, so far we did not use any of the geo spatial functions we’ve declared above. Let’s say we’d like to calculate the distance and average speed for each trip. The idea is to calculate the distance between two GPS points along the trip. Therefore I’m using window functions again to retrieve the preceding GPS coordinates along the way as shown below:

select
id,
dt,
latitude,
longitude,
lag(latitude,1) over (partition by id order by dt) prev_latitude,
lag(longitude,1) over (partition by id order by dt) prev_longitude
from uber;

Looking at a very short trip shows how we now have the preceding coordinates in the same row.

image

In order to calculate the distance between the coordinates, we first create a line between the two coordinates using the following query (the blue part is exactly the query from above):

Select
id,dt,
unix_timestamp(dt)-unix_timestamp(lag(dt,1) over (partition by id order by dt)) time_passed,
latitude, longitude,
ST_LineString(prev_longitude, prev_latitude, longitude, latitude) L
from
(select id, dt, latitude, longitude,
lag(latitude,1) over (partition by id order by dt) prev_latitude,
lag(longitude,1) over (partition by id order by dt) prev_longitude
from uber
) Sub;

image

The line is not visualized in the query editor, so we only see its binary representation. The next step is a little bit tricky because we need to calculate the length of the line. Here, the function ST_GeodesicLengthWGS84 can be used as shown below (again the blue part is the previous query). This functions returns the distance in meters on a spheriod, however the correct SRID has to be set before:

select id, dt, time_passed, ST_GeodesicLengthWGS84(ST_SetSRID(L,4326)) Distance
from
(
Select id,dt,
unix_timestamp(dt)-unix_timestamp(lag(dt,1) over (partition by id order by dt)) time_passed,
latitude, longitude,
ST_LineString(prev_longitude, prev_latitude, longitude, latitude) L
from (select id, dt, latitude, longitude, lag(latitude,1) over (partition by id order by dt) prev_latitude, lag(longitude,1) over (partition by id order by dt) prev_longitude from uber) Sub
) Sub1

image

The only thing left is to aggregate distance and time difference. Since the time difference is in seconds and the distance is in meters, we can calculate speed in km/h as distance/1000 * (3600 / time_difference). This results in the following query (again only shown for trip no 44):

select id, min(dt) start_time, max(dt) end_time, Sum(Distance) distance, Sum(Distance)/1000 * (3600.0/Sum(time_passed)) avg_speed
from
(select id, dt, time_passed, ST_GeodesicLengthWGS84(ST_SetSRID(L,4326)) Distance from
(Select id,dt,
unix_timestamp(dt)-unix_timestamp(lag(dt,1) over (partition by id order by dt)) time_passed,
latitude, longitude,
ST_LineString(prev_longitude, prev_latitude, longitude, latitude) L
from (select id, dt, latitude, longitude, lag(latitude,1) over (partition by id order by dt) prev_latitude, lag(longitude,1) over (partition by id order by dt) prev_longitude from uber) Sub
) Sub1
) Sub2
group by id;

image

 

Ok, now this looks rather like a rather short trip, however it was still good to see the idea. Here are the waypoints from this short trip (PowerView):

image

 

Now let’s assume we’re interested in all the trips that go the the airport. In order to run this query, we need to define the area for the airport. I’m using Google Earth with this purpose here and simple draw a polygon for my target area:

image

You can then export the polygon to a .kml file. Since I have a very simple polygon here, I transferred the coordinates from the .kml-file manually into my geo spatial polygon type:

kml file ESRI polygon

<LinearRing>
<coordinates>
-122.3939778660149,37.61286698251174,0
-122.387217599775,37.61405599476924,0
-122.3886445577052,37.61737414129696,0
-122.3948268153237,37.61592559803017,0
-122.3939778660149,37.61286698251174,0
</coordinates>
</LinearRing>



ST_Polygon(
-122.3939778660149,37.61286698251174,
-122.387217599775,37.61405599476924,
-122.3886445577052,37.61737414129696,
-122.3948268153237,37.61592559803017,
-122.3939778660149,37.61286698251174

)

In order to find all trips that end at the airport we just need to get the end-point for each trip and use ST_Intersects to see if this endpoints is within the polygon. Here is the resulting query: 

select count(*) CntTripsToAirport
from (
select id, start_coords, ST_Point(longitude, latitude) end_coords
from (
   select
     id,dt,longitude, latitude,
     first_value(ST_Point(longitude, latitude)) over (partition by id order by dt) start_coords,
     row_number() over (partition by id order by dt desc) lnr 
  from uber 
  ) Sub
Where lnr=1
) trips
where
ST_Intersects(
  ST_Polygon(
   -122.392291,37.616543,
   -122.389115,37.616458,
   -122.389115,37.613552,
   -122.392119,37.613297,
   -122.392291,37.616543
)
, trips.end_coords);

This results in 26 trips:

image

Of course you can do a lot more using this data set and the geo spatial functions in Hive, so this post should just give you an idea. Also, you can use visualization tools like PowerView. For example, the following map shows the locations where most of the trips started (red means a lot of trips starting here):

image

Monday, February 2, 2015

PowerBI Dashboard Public Preview

PowerBI

PowerBI Dashboard is a new interactive dashboard and visualization platform that enables end users to select and visualize the information they are most interested in. No support from IT is needed to build the dashboard and for me, this is the most exciting feature of the new dashboard concept. As I also stated in an older post dashboards need to be easy to create as the need for performance indicators that you really want to see on a dashboard changes over time. You also want the same dashboard to be available on mobile devices as well as on your local workspace. What I like is the separation between visualizations and the dashboard. In fact you can choose any element of your PowerView sheet for example to be displayed as a tile on the dashboard (“pin to dashboard”) showing important information and acting as a link to the full analytical sheet.

I believe that Microsoft follows this approach very consequently with the new PowerBI Dashboard, so it’s really worth spending some time on. Currently it is available as a public beta.

Here is an example of a dashboard that I just created with PowerBI Dashboard using the Tailspin demo data.

image

The complete steps for creating this dashboard are shown in the following video.

PowerBI Dashboard Preview

In the video you’ll see how to

  • upload a PowerView sheet to PowerBI Dashboard
  • put elements from the PowerView sheet to the dashboard (“pin to dashboard”)
  • edit the dashboard in the web or on your mobile device (within PowerBI Dashboard)
  • arrange elements on your dashboard
  • use natural question & answers functionality to create new visualizations and new content for your dashboard

Monday, January 26, 2015

T-SQL Last Non Empty

SQL Server 2012 | SQL Server 2014 | PDW/APS 2012

Recently we needed to calculate something like a ‘last non empty’ value in a T-SQL query. This blog post is about the solution we ended up with as an alternative to the classic pattern involving sub-queries.

To illustrate the task let’s first look at some data:

image

The extract shows a contract table with some gaps. The task is to fill the gaps with the last non empty contract of the same id. So, here is the final result:

image

As you can see, apart from the first days for id 2 which don’t have a last value, all gaps have been filled.

In order to fill the gaps using T-SQL window functions, the idea is to calculate the number of steps we need to go back for each null value to catch the corresponding last value. In the following screenshot, I’m showing this value as the last column:

image

For example, for ID 1, Date 2014-07-17 we have to go two rows back (2014-07-15) to get the last non empty value. For the first two dates for ID 2 we also have a lag-value, however there is no corresponding row. Looking at the lag columns it somewhat looks like a row_number over rows with a contract value of NULL. Actually, looking at ID 2 there may be more than one gap (NULL values) so it’s more like a row number over groups of contracts. To determine those groups we need to find changes in the contract over time. So let’s start with this first.

with
C1 as
(select ID, Date, Contract
, iif(isnull(Contract,'') <> isnull(lag(Contract,1) over (partition by ID order by Date),''),1,0) ContractChange
from  [dbo].[Contracts])

select * from C1 order by ID, Date

image

Using the lag window-function I added a new column ‘ContractChange’ that gives 1 whenever the contract changes and 0 otherwise. The next step is to calculate a running total of the column to build up groups of contracts:

with
C1 as…
C2 as
(select ID, Date, Contract, ContractChange,
sum(ContractChange) over (partition by id order by Date) ContractGroup
from C1)

select * from C2 order by ID, Date

image

The new column ‘ContractGroup’ now calculates a value that increments whenever the contract changes. We can now calculate a row_number using the ContractGroup column as the partition:

with
C1 as…
C2 as…
C3 as
(select ID, Date, Contract, ContractChange, ContractGroup,
row_number() over (partition by id, ContractGroup order by Date) LastContractLag
from C2)

select * from C3 order by ID, Date

 

image

And actually, the LastContractLag column here is already the value we need for the lag-function to get to the non-empty value. So here is the final query (including the intermediate calculations from above):

with
C1 as
(select ID, Date, Contract
, iif(isnull(Contract,'') <> isnull(lag(Contract,1) over (partition by ID order by Date),''),1,0) ContractChange
from  [dbo].[Contracts])
,
C2 as
(select ID, Date, Contract, ContractChange,
sum(ContractChange) over (partition by id order by Date) ContractGroup
from C1)
,
C3 as
(select ID, Date, Contract, ContractChange, ContractGroup,
row_number() over (partition by id, ContractGroup order by Date) LastContractLag
from C2)

select ID, Date, Contract
,iif(Contract is null, lag(Contract,LastContractLag) over (partition by id order by Date),Contract) ContractLastNonEmpty
from C3
order by ID, Date

The output of this query is shown above (final result). And again this is a good example of the power of window functions.

Conclusion

In our situation, this solution performed much better than a sub-query approach, but depending on the table layout and the amount of data, other approaches may still be better, so you may want to try different patterns for solving this problem.

Sunday, December 21, 2014

Using OpenStreetMap data in PowerQuery

PowerBI | PowerQuery

This blog post is about using XML data from OpenStreetMap in PowerQuery for example to create a local geocoding database. PowerQuery is just perfect for this purpose as it allows us to interactively shape the input data to the desired format.

First, we need to download some data from OpenStreetMap. You may go to the OpenStreetMap webpage, zoom to the desired area and then hit the export button. Alternatively, download links are provided for full country or region files after hitting ‘Export’.

image

The downloaded OSM file is in XML format and starts like this:

<?xml version="1.0" encoding="UTF-8"?>
<osm version="0.6" generator="CGImap 0.3.3 (733 thorn-01.openstreetmap.org)" copyright="OpenStreetMap and contributors" attribution="
http://www.openstreetmap.org/copyright" license="http://opendatacommons.org/licenses/odbl/1-0/">

Now, open Excel and create a new PowerQuery from an XML file:

image

Next, select the downloaded file and hit Enter. PowerQuery loads with the file contents:

image

The data we’re interested in is stored in the nodes-table. So we can now click on the ‘Table’ link in the nodes column:

image

We now have to expand the ‘tag’ information:

image

Since we’re not interested in all the data from the file, we limit the information to country, city, postcode, street and housenumber by using the filter function:

image

We can now delete all columns but the following:

  • tag.attribute.k
  • tag.attribute.v
  • attribute:id
  • attribute:lat
  • attribute:long

image

image

This is pretty much the data we’d like to obtain from the OSM file. However, we still need to pivot to information from the rows into columns to see city, postcode etc. side by side with the geo coordinates. In order to do so, we select column tag.Attribut:k and use the Pivot function from the Transform ribbon as shown below:

image

Finally, we can rename the columns and filter for rows where country is not null:

image

The list now contains addresses (city, postcode, street, house number) together with the corresponding geographical coordinates (latitude, longitude).

For your reference, this is the script we created so far:

let
    Source = Xml.Tables(File.Contents("C:\Temp\download.osm")),
    node = Source{0}[node],
    #"Changed Type" = Table.TransformColumnTypes(node,{{"Attribute:id", Int64.Type}, {"Attribute:visible", type logical}, {"Attribute:version", Int64.Type}, {"Attribute:changeset", Int64.Type}, {"Attribute:timestamp", type datetime}, {"Attribute:user", type text}, {"Attribute:uid", Int64.Type}, {"Attribute:lat", type number}, {"Attribute:lon", type number}}),
    #"Expand tag" = Table.ExpandTableColumn(#"Changed Type", "tag", {"Attribute:k", "Attribute:v"}, {"tag.Attribute:k", "tag.Attribute:v"}),
    #"Filtered Rows" = Table.SelectRows(#"Expand tag", each ([#"tag.Attribute:k"] = "addr:city" or [#"tag.Attribute:k"] = "addr:country" or [#"tag.Attribute:k"] = "addr:housenumber" or [#"tag.Attribute:k"] = "addr:postcode" or [#"tag.Attribute:k"] = "addr:street")),
    #"Removed Columns" = Table.RemoveColumns(#"Filtered Rows",{"Attribute:visible", "Attribute:version", "Attribute:changeset", "Attribute:timestamp", "Attribute:user", "Attribute:uid"}),
    #"Pivoted Column" = Table.Pivot(#"Removed Columns", List.Distinct(#"Removed Columns"[#"tag.Attribute:k"]), "tag.Attribute:k", "tag.Attribute:v"),
    #"Renamed Columns" = Table.RenameColumns(#"Pivoted Column",{{"Attribute:id", "id"}, {"Attribute:lat", "latitude"}, {"Attribute:lon", "longitude"}, {"addr:city", "city"}, {"addr:postcode", "postcode"}, {"addr:street", "street"}, {"addr:country", "country"}, {"addr:housenumber", "housenumber"}}),
    #"Filtered Rows1" = Table.SelectRows(#"Renamed Columns", each ([country] = "DE"))
in
    #"Filtered Rows1"

The result can then be loaded to a PowerPivot data model (click on File, then Load to…):

image

After loading the data into the model, you can for example use PowerMaps to visualize the addresses on a map:

image

Of course, you could have received the similar result using the geocoding functionality of PowerMaps. Therefore, you simply need to add another column for the full address like this:

image

The result is pretty much the same but it takes a significant amount of time to geo code many addresses on this level of detail compared to the direct approach from above, which doesn’t need to pass geo coding web service because we already provided latitude and longitude.

image

So, if you have to geo code a lot of addresses, the OpenStreetMap XML format may be a lot of help. Using PowerQuery makes it easy to load this data into a PowerPivot data model. For larger regions make sure you have enough memory available in the local machine (and use the 64bit version of Microsoft Excel).

Sunday, November 30, 2014

SQL Server Data Files on Azure

Azure | SQL Server 2014

With SQL Server 2014 it’s easy to move database files to the Azure Blog storage even if the SQL Server runs on premise. Azure Blob storage offers reliable, cheap and high available storage, which could be useful for “cold” data for example.

However, configuration is a little bit tricky, so I’m going to walk through this process step by step.

1. Create an Azure Blob store account and a container

Log into Azure and create a new storage account. For my example, I’m using “db4” as the name as shown below:

image

Next, I’m going to create a blob store container, which I name “data” here:

image

In order to access the container, we need the URL to the container (db4.core.windows.net/data in my example) and the storage key. The key can be obtained by clicking on “Manage Access Keys” on the bottom of the screen:

image

You can copy the key to the clipboard by clicking on the icon right besides the Primary Access Key box.

image

For the next task I’m using Windows Azure Storage Explorer (download here). Here you can add your storage account by pasting the access key into the storage account key input box:

image

 

2. Create a Shared Access Signature for the container

In Azure Storage explorer, select the container (data) and click on ‘Security’:

image

This brings up the following dialog. Make sure to select the permissions list, delete, read and write. After clicking on ‘Generate Signature’ a shared access signature is created. Copy this signature to the clipboard.

image

 

3. In SQL Server: Create a credential for the blob container

In SQL Server we’re using the create credential statement to create a credential for the blob store. Make sure to replace the secret key with the generated shared access signature from the last step (I just obfuscated the key by overwriting part of the key with ‘x’):

CREATE CREDENTIAL [https://db4.blob.core.windows.net/data]
WITH IDENTITY='SHARED ACCESS SIGNATURE',
SECRET = 'sv=2014-02-14&sr=c&sig=c%2Fxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx3%3A00%3A00Z&se=2014-12-31T23%3A00%3A00Z&sp=rwdl'

If you like, you can check the credentials with “select * from sys.credentials”:

image

 

4. In SQL Server: Create a database that uses the blob container

The next step is simple. We just need to create a database using the container as its storage:

CREATE DATABASE testdb
ON
( NAME = testdb_dat,
FILENAME = 'https://db4.blob.core.windows.net/data/TestData.mdf' )
LOG ON
( NAME = testdb_log,
FILENAME = 'https://db4.blob.core.windows.net/data/TestLog.ldf')

You can create tables and load data in just the same way as you would do with a local database file. Azure Storage Explorer lists the database files that are created:

image

 

5. Optional: Register the blob store in SQL Server Management Studio

You can register the blob store in SQL Server Management Studio by creating a connection to Azure:

image

The “Access Key” is the key we created in the first step and can simply be copied into the account key field:

image

After connecting to the Azure blob store, Management Studio shows our container together with the database files:

image

Of course, when placing database files on Azure, a connection is needed to the blob store. If you don’t have this connection, you will not be able to access the database:

image

 

Summary

With SQL Server 2014 it is easy to put data files on an Azure storage account even for an on premise SQL Server. Use cases include

  • store data that is not heavily queried
  • store data that you want to secure in a geo-redundant way
  • enhance the local storage of a SQL Server
  • perform a single table backup to the cloud
  • … and many more