Search This Blog

Tuesday, February 17, 2015

Querying Hadoop from SQL Server

SQL Server 2012 | SQL Server 2014

Microsoft’s Analytics Platform System (APS) offers built in transparent access to Hadoop data sources through the Polybase technology. This includes bidirectional access not only to Hadoop but also to Cloud services. The SMP SQL Server currently doesn’t contain Polybase, so access to Hadoop needs to be handled differently. Will Polybase be available in an upcoming SMP SQL Server? From the past we saw some technology making its way from PDW to SMP SQL Server, for example the clustered columnstore index, the cardinality estimation or the batch mode table operations. So let’s hope that Polybase makes it into the SMP SQL Server soon. Until then, one option is to use the HortonWorks ODBC driver and linked tables. To be honest, Polybase is a much more powerful technology since it uses cost-based cross platform query optimization which includes the ability to push down tasks to the Hadoop cluster when it makes sense. Also, Polybase doesn’t rely on Hive but access the files directly in parallel, thus giving a great performance. Linked tables are less powerful but may still be useful for some cases.

So, here we go. First, you need to download the ODBC driver from the Hortonworks add-ons page:

Make sure you pick the right version (32 bit/64 bit) for your operating system. After the installation completes, we need to set up an ODBC connection. Therefore, start the ODBC Datasource Adminstrator (Windows+S, then type ‘ODBC’). Again, make sure to start the correct version (32 bit/64 bit). The installer has already created a connection but you still need to supply the connection properties. I created a new connection instead:


I’m connecting to the Hortonworks Sandbox here (HDP 2.1, I had problems connecting to HDP 2.2 with the current version of the ODBC driver). Instead of the host name you can also enter the IP address (usually for the sandbox) but in order to get other tools running (like Redgate Hdfs Explorer) I configured the sandbox virtual machine to run on a bridged network and put the bridge network IP address of the sandbox (console command “ip addr”) in my local host file.

You should now click on Test to verify that the connection actually works:


In SQL Server Management Studio we can now create a linked server connection to the Hadoop system using the following command:

EXEC master.dbo.sp_addlinkedserver
@server = N'Hadoop',
@provstr=N'Provider=MSDASQL.1;Persist Security Info=True;User ID=hue;'

Depending on you Hadoop’s security settings, you might need to provide a password for the provider string as well. The @server name is used to refer to the linked server later while the @datasrc names the ODBC connection (see “Data Source Name” in the configuration dialog of the connection above).

With the new linked server, we can now explore the Hive database in Management Studio:


In order to run a query on for example table “sample_07” you can user one of the following commands:

select * from openquery (Hadoop, 'select * from Sample_07')


select * from [Hadoop].[HIVE].[default].[sample_07]

For both queries, “Hadoop” refers to the name of the linked server (@server parameter in the SQL statement from above).

If you get the following error message, this means that you are not allowed to query the table:

OLE DB provider "MSDASQL" for linked server "Hadoop" returned message "[Hortonworks][HiveODBC] (35) Error from Hive: error code: '40000' error message: 'Error while compiling statement: FAILED: HiveAccessControlException Permission denied. Principal [name=hue, type=USER] does not have following privileges on Object [type=TABLE_OR_VIEW, name=default.sample_07] : [SELECT]'.".
Msg 7306, Level 16, State 2, Line 1
Cannot open the table ""HIVE"."default"."sample_08"" from OLE DB provider "MSDASQL" for linked server "Hadoop".

In this case, you should simply give the user from you ODBC connection the SELECT right. To do so, run the following query in Hive:

grant select on sample_07 to user hue;

That’s it. You should now get the contents of the table in SQL Server:


You might want to set the length of string columns manually because Hive does not return the size of the string column (in Hive, the column type is simply “string”). The size returned from the query results from the advanced ODBC-settings of our connection. I left everything on default here, so here is how it looks:


So, the default string column length is 255 here. Let’ check and copy the data over to SQL Server:

select * into sample_07 from [Hadoop].[HIVE].[default].[sample_07]

The resulting table looks like this:


To have a more precise control of the column length, you should use the convert function here, for example:

convert(nvarchar(50),[code]) [code],
convert(nvarchar(80),[description]) [description],
from [Hadoop].[HIVE].[default].[sample_07]

Be careful with the remaining setting in the advanced options dialog. For example, checking “Use native query” means that you pass the query (openquery-Syntax) as it is to Hive. This could be intended to fully leverage specific features of Hive, but this could also lead to errors if you’re not familiar with the HiveQL query syntax. Also, to get a better with larger tables you might want to adjust the “Rows fetched per block” option to a larger value.

With HDP 2.2 you should also be able to write to the table (create a new table, grant all permissions and run an insert into) but I couldn’t do on my HDP 2.1 machine.


Until Polybase makes it into the SMP SQL Server product, Hadoop data may be queried from SQL Server using the ODBC driver and the linked server object. This could also be an option for Analysis Services to connect to Hadoop by using SQL Server views via linked server, since Analysis Services doesn’t support ODBC in multi dimensional mode. However, Polybase on the APS gives a much better performance because of the intelligent cross platform query optimizer and Polybase can also be used to write data to Hadoop, so I hope we’ll find this technology in the SMP SQL Server soon.

Tuesday, February 10, 2015

Hive file format comparison

Apache Hive 0.11+

in this post I’d like to compare the different file formats for Hive, as well as the different execution times for queries depending on the file format, compression and execution engine. As for the data, I’m using the Uber data set that I also used in my last post. I’m aware that the following query stats are not exhaustive and you may get to different results in your environment or with other table formats. Also you could try different serdes for Hive as well as consider compression. Still, it gives you some idea that both the file format and the execution engine play an important role for Hive’s query performance. However, when choosing a file format, you may also consider data management aspects. For example, if you get your source files in CSV format, than you will likely process the files in this format at least during the first process step.

As test queries I used the query to measure the total trip time (query 1) and the query to find all trips ending at San Francisco airport (query 2) from my last post. Here is the result for the file formats I tested:


Here is some information about the different file formats being used here:

File format Description
textfile separated text file (for example tab separated fields)
rcfile internal hive format (binary)
orc columnar storage format (highly compressed, binary, introduced with Hive 0.11)
parquet columnar storage format (compressed, binary)
Parquet is supported by a plugin in Hive since version 0.10 and natively in Hive 0.13 and later.
avro serialization file format from Apache Avro (contains schema and data, tools available for processing).
Avro is supported in Hive since version 0.9.

For Table create/Write Time I measured a “create table as select” (CTAS) into the specific format. As you can see, the resulting size of the table depends a lot on the file format. The columnar Orc file format compresses the data in a very efficient way:


Using Tez as the execution engine (set hive.execution.engine=tez) results in a much better performance compared to map reduce (set hive.execution.engine=mr). The total time for the two queries is shown in this table:


In map reduce mode, query time does not seem to depend too much on the file format being used:


However, when running the queries in Tez, you’ll see a significant difference between file formats like Parquet and Orc (with Orc being about 30% faster)


Tuesday, February 3, 2015

Geo spatial data support for Hive

Hive 0.1x

Apache Hive doesn’t have support for geo spatial functions out of the box. However, it’s easy to extend the functionality by using external libraries, that can be called via user defined functions (UDFs). This post shows how to use the ESRI Hadoop libraries in Hive. I’m using the Hortonworks HDP sandbox here, so it’s easy to reproduce the examples from below.



In order to use the libraries, they first need to be installed. I’m using the following script to download, build and deploy the libraries to the Hadoop file system. Please note that you may have to correct the build number of the jar files. You will also need maven to build the libraries, so you need to install this prior to running the following commands:

# create working directories
cd /tmp
mkdir gis
cd gis
mkdir framework
mkdir api

# download and build framework
cd framework
unzip master
cd spatial-framework-for-hadoop-master
mvn clean package
hadoop fs -put hive/target/spatial-sdk-hive-1.0.3-SNAPSHOT.jar /user/esri
hadoop fs -put json/target/spatial-sdk-json-1.0.3-SNAPSHOT.jar /user/esri
hadoop fs -put hive/function-ddl.sql /user/esri

# download and build api
cd /tmp/gis/api
unzip master
cd geometry-api-java-master
mvn clean package
hadoop fs -put target/esri-geometry-api-1.2.jar /user/esri

# clean up
cd /tmp
rm -rf gis

If successful you should see the following files in your Hadoop file system, directory /user/esri:



Making the ESRI functions available in Hive

After building the ESRI libraries, we still need to declare the new user defined functions within Hive. First we start with the jar-files:

add jar hdfs:///user/esri/spatial-sdk-hive-1.0.3-SNAPSHOT.jar;
add jar hdfs:///user/esri/spatial-sdk-json-1.0.3-SNAPSHOT.jar;
add jar hdfs:///user/esri/esri-geometry-api-1.2.jar;

Running a “list jars” in Hive shows the libraries being loaded successfully:


Please note that you can add the add jar commands to the .hiverc file. You can find more details on that file in this blog post.

Next we do have to declare the spatial functions. You will find the full declaration for all functions in the function-ddl.sql file that I also placed in hdfs under /user/esri (as shown in the screenshot from above) using the build script. However, for this post, I’m only using a few functions with the following declarations:

create temporary function ST_Point as 'com.esri.hadoop.hive.ST_Point';
create temporary function ST_LineString as 'com.esri.hadoop.hive.ST_LineString';
create temporary function ST_Length as 'com.esri.hadoop.hive.ST_Length';
create temporary function ST_GeodesicLengthWGS84 as 'com.esri.hadoop.hive.ST_GeodesicLengthWGS84';
create temporary function ST_SetSRID as 'com.esri.hadoop.hive.ST_SetSRID';
create temporary function ST_Polygon as 'com.esri.hadoop.hive.ST_Polygon';
create temporary function ST_Intersects as 'com.esri.hadoop.hive.ST_Intersects';

create temporary function ST_Polygon as 'com.esri.hadoop.hive.ST_Polygon';
create temporary function ST_Intersects as 'com.esri.hadoop.hive.ST_Intersects';

Starting with Hive 0.13 you can also declare these functions permanently by omitting the “temporary” keyword from the declarations from above. This has the advantage that you do not need to declare the functions for every session. You can also include the jar file in the create function statement which makes it easier to create a permanent declaration.

For example, for the definition of the ST_Point function you would write the following SQL statement:

create function ST_Point as 'com.esri.hadoop.hive.ST_Point' using jar 'hdfs:///user/esri/spatial-sdk-hive-1.0.3-SNAPSHOT.jar';

Using the ESRI functions in Hive

In order to demonstrate the ESRI geo spatial functions I’m using the Uber data set with 25,000 Uber routes from San Francisco.

The first rows of the data are looking like this:


Each trip has a id (1 to 25,000) with the latitudes and longitudes at each time. For example, using window functions we can easily calculate the total time for each trip (by using first_value to get the starting time and restricting the result set to the last entry per trip):

-- get trip duration
select id, unix_timestamp(dt)-unix_timestamp(dt_first) trip_duration
from (
first_value(dt) over (partition by id order by dt) dt_first,
row_number() over (partition by id order by dt desc) lnr
from uber
) Sub
Where lnr=1;



Ok, so far we did not use any of the geo spatial functions we’ve declared above. Let’s say we’d like to calculate the distance and average speed for each trip. The idea is to calculate the distance between two GPS points along the trip. Therefore I’m using window functions again to retrieve the preceding GPS coordinates along the way as shown below:

lag(latitude,1) over (partition by id order by dt) prev_latitude,
lag(longitude,1) over (partition by id order by dt) prev_longitude
from uber;

Looking at a very short trip shows how we now have the preceding coordinates in the same row.


In order to calculate the distance between the coordinates, we first create a line between the two coordinates using the following query (the blue part is exactly the query from above):

unix_timestamp(dt)-unix_timestamp(lag(dt,1) over (partition by id order by dt)) time_passed,
latitude, longitude,
ST_LineString(prev_longitude, prev_latitude, longitude, latitude) L
(select id, dt, latitude, longitude,
lag(latitude,1) over (partition by id order by dt) prev_latitude,
lag(longitude,1) over (partition by id order by dt) prev_longitude
from uber
) Sub;


The line is not visualized in the query editor, so we only see its binary representation. The next step is a little bit tricky because we need to calculate the length of the line. Here, the function ST_GeodesicLengthWGS84 can be used as shown below (again the blue part is the previous query). This functions returns the distance in meters on a spheriod, however the correct SRID has to be set before:

select id, dt, time_passed, ST_GeodesicLengthWGS84(ST_SetSRID(L,4326)) Distance
Select id,dt,
unix_timestamp(dt)-unix_timestamp(lag(dt,1) over (partition by id order by dt)) time_passed,
latitude, longitude,
ST_LineString(prev_longitude, prev_latitude, longitude, latitude) L
from (select id, dt, latitude, longitude, lag(latitude,1) over (partition by id order by dt) prev_latitude, lag(longitude,1) over (partition by id order by dt) prev_longitude from uber) Sub
) Sub1


The only thing left is to aggregate distance and time difference. Since the time difference is in seconds and the distance is in meters, we can calculate speed in km/h as distance/1000 * (3600 / time_difference). This results in the following query (again only shown for trip no 44):

select id, min(dt) start_time, max(dt) end_time, Sum(Distance) distance, Sum(Distance)/1000 * (3600.0/Sum(time_passed)) avg_speed
(select id, dt, time_passed, ST_GeodesicLengthWGS84(ST_SetSRID(L,4326)) Distance from
(Select id,dt,
unix_timestamp(dt)-unix_timestamp(lag(dt,1) over (partition by id order by dt)) time_passed,
latitude, longitude,
ST_LineString(prev_longitude, prev_latitude, longitude, latitude) L
from (select id, dt, latitude, longitude, lag(latitude,1) over (partition by id order by dt) prev_latitude, lag(longitude,1) over (partition by id order by dt) prev_longitude from uber) Sub
) Sub1
) Sub2
group by id;



Ok, now this looks rather like a rather short trip, however it was still good to see the idea. Here are the waypoints from this short trip (PowerView):



Now let’s assume we’re interested in all the trips that go the the airport. In order to run this query, we need to define the area for the airport. I’m using Google Earth with this purpose here and simple draw a polygon for my target area:


You can then export the polygon to a .kml file. Since I have a very simple polygon here, I transferred the coordinates from the .kml-file manually into my geo spatial polygon type:

kml file ESRI polygon




In order to find all trips that end at the airport we just need to get the end-point for each trip and use ST_Intersects to see if this endpoints is within the polygon. Here is the resulting query: 

select count(*) CntTripsToAirport
from (
select id, start_coords, ST_Point(longitude, latitude) end_coords
from (
     id,dt,longitude, latitude,
     first_value(ST_Point(longitude, latitude)) over (partition by id order by dt) start_coords,
     row_number() over (partition by id order by dt desc) lnr 
  from uber 
  ) Sub
Where lnr=1
) trips
, trips.end_coords);

This results in 26 trips:


Of course you can do a lot more using this data set and the geo spatial functions in Hive, so this post should just give you an idea. Also, you can use visualization tools like PowerView. For example, the following map shows the locations where most of the trips started (red means a lot of trips starting here):


Monday, February 2, 2015

PowerBI Dashboard Public Preview


PowerBI Dashboard is a new interactive dashboard and visualization platform that enables end users to select and visualize the information they are most interested in. No support from IT is needed to build the dashboard and for me, this is the most exciting feature of the new dashboard concept. As I also stated in an older post dashboards need to be easy to create as the need for performance indicators that you really want to see on a dashboard changes over time. You also want the same dashboard to be available on mobile devices as well as on your local workspace. What I like is the separation between visualizations and the dashboard. In fact you can choose any element of your PowerView sheet for example to be displayed as a tile on the dashboard (“pin to dashboard”) showing important information and acting as a link to the full analytical sheet.

I believe that Microsoft follows this approach very consequently with the new PowerBI Dashboard, so it’s really worth spending some time on. Currently it is available as a public beta.

Here is an example of a dashboard that I just created with PowerBI Dashboard using the Tailspin demo data.


The complete steps for creating this dashboard are shown in the following video.

PowerBI Dashboard Preview

In the video you’ll see how to

  • upload a PowerView sheet to PowerBI Dashboard
  • put elements from the PowerView sheet to the dashboard (“pin to dashboard”)
  • edit the dashboard in the web or on your mobile device (within PowerBI Dashboard)
  • arrange elements on your dashboard
  • use natural question & answers functionality to create new visualizations and new content for your dashboard