Search This Blog

Sunday, September 29, 2013

PDW, Integration Services and Resource Classes

PDW v1 | PDW 2012 | SQL Server 2008-2012

When using the Parallel Data Warehouse (PDW) as a data source for SSIS (Integration Services) packages, some pitfalls have to be avoided. This post is about one of them that I recently saw.

But first some background information about the query slots on the PDW: PDW v1 used 32 query slots for queries targeting user tables (more for system tables) and each running query takes exactly one of those slots. By this procedure it is guarantied, that no single query can take too many resources of the appliance which means, that even under heavy query load, the appliance will respond on the other query slots as usual. This is one of the reasons why the PDW responds to queries with a high performance consistently, even if the appliance is under heavy workload. If all 32 query slots are in use, new arriving queries are suspended up to a certain limit depending on the version of the PDW.

However, in some cases we want to assign more resources to a query, for example for large data processing operations. For this reason, PDW 2012 introduced the concept of resource classes. Here are the four resource classes that exist on PDW 2012:

Name Query slots Max. Memory
smallrc (default) 1 0.4 GB
mediumrc 3 1.2 GB
largerc 7 2.8 GB
xlargerc 22 8.4 GB

The total number of 32 query slots is still valid for PDW 2012, so to give some examples, PDW 2012 can simultaneously execute

  • 32 queries in resource class smallrc
    or
  • 3 queries in resource class largerc (taking 21 slots) and 11 queries in resource class smallrc
    or
  • 1 query in resource class xlargerc, one in resource class largerc and one in resource class mediumrc
    and so on

 

The resource classes are assigned to the login. By default, newly created logins are in the smallrc resource class.

The approach for handling queries on the PDW works perfectly for the usual workload. Even if queries are suspended because the 32 query slots are currently in use, they will run, whenever a query slot gets available again. However, if the client application opens more than one connection at a time and keeps them open, you have to be careful with the resource classes.

Here is a simple example using SQL Server Integration Services. My data flow just contains of two independent data sources with no further processing. I used the FactInternetSales table from AdventureWorksPDW2012, so the amount of rows is quite small. As expected, the package runs perfectly well:

image

Now, let’s assume we want the package to execute faster and therefore we assign a better resource class. Let’s try the largerc class here:


alter server role largerc add member testuser;

The members of the resource classes can be listed using this query taken from the documentation (chm-file, search for meta data queries):

SELECT l.name AS [member], r.name AS [server role]
FROM sys.server_role_members AS rm
JOIN sys.server_principals AS l
  ON l.principal_id = rm.member_principal_id
JOIN
  sys.server_principals AS r
  ON r.principal_id = rm.role_principal_id
WHERE
  l.[type] = 'S'
  AND r.[type] = 'R'
  AND r.[name] in ('mediumrc', 'largerc', 'xlargerc');

Here’s the result:

image

The package still runs perfectly fine. Note that the two queries in resource class largerc take 2 x 7 = 14 query slots.

Now let’s try the xlargerc resource class by adding the user to this role:

alter server role largerc drop member testuser;
alter server role xlargerc add member testuser;

The query from above shows the user now being member of the xlargerc server role:

image

Now, when executing the query in SQL Server Data Tools you will notice that nothing happens. The package passes validation but no data is transferred:

image

Checking the SSIS progress reveals, that the SSIS package stalls in the pre-execute phase of the data flow.

image

SSIS tries to open two queries with resource class xlargerc while the PDW allows only one (as 2 x 22 is greater than 32). Therefore the second query is suspended, waiting for the first to finish. But this will never happen. You can clearly see this using the following meta data query:

SELECT request_id, concurrency_slots_used AS [Slots Needed], resource_class AS [Resource Class]
FROM sys.dm_pdw_resource_waits;

image

Here, query QID15444 is the meta data query itself. So there are two queries that need resource class xlargerc: QID15433 and QID15442. You can also see them in the admin console:

image

Looking at the query details of one of the two queries shows that the query is in resource class xlargerc and consuming 22 slots as shown in the following screenshot:

image

Again, moving the user to a lower resource class (small/medium/large) would solve the issue here.

In general, all connections within a single dataflow of SSIS are opened together during the Pre-Process phase of the data flow. For example, the same behavior would occur with an OLEDB source and a lookup:

image

Again, the package stalls during the pre-process phase:

image

So, be careful when assigning higher resource classes to logins used for client tools that need to have multiple queries returning results simultaneously. To make it clear, it’s not a problem if multiple queries are executed in parallel as long as the the client tool does not stop reading from one query if the other doesn’t return data (yet). In my example the reason for the ETL package to stall was that SSIS expects both queries to return their result sets at the same time. If only one of the result sets is suspended, the other remains open and SSIS continues waiting. So, for client tools like SSIS that are working with the data in a data flow pipeline, dependencies could lead to dead lock situations if no more query slots are free.

As long as each data flow of the package opens only the number of connections that are allowed by the specific resource class, the data flow executes fine. For example, here are 16 data flows each using a single OLEDB connection:

image

The reason why this works is that each data flow performs the pre-process stage independently. When setting the user to resource class xlargerc (consuming 22 slots per query) you can see how each data flow is running sequentially (taking about 12 sec in my example). With the resource class smallrc (default) the queries are running in parallel (taking about 2.5 sec in my example). The following video shows the execution for both resource classes:

Parallel data flows compared in resource class xlargerc and smallrc

 

Be aware though, that this situation can also happen even with resource class smallrc and 32 query slots, as ETL packages that are performing complex operations with many merge-joins and lookups could also reach the limit of the query slots and stall because of a dead lock within the ETL package. Usually, these packages would run perfectly fine and with a good performance on a SQL Server SMP machine but are throttled if the data source is moved to a PDW. So you might also see this situation on a PDW v1 in case you’re working with more complex ETL packages. The only solution would be to re-engineer the package taking some of the following ideas into account :

  1. Avoid operations where many data connections are used in a single data flow, for example by combining them into a single query (doing the join/process operation on the PDW)
  2. Try splitting single data flows with many connections to separate data sources with fewer connections.
  3. In general reduce the number of open queries that need to read in parallel if the output streams depend on each other, for example by using intermediate caching of the data (for example using the cache destination)
  4. Use smallrc resource group on PDW 2012 for the ETL accounts if you’re experiencing SSIS dead locks.
  5. In general, re-design the data integration processes to move more workload from ETL to the Data Warehouse Appliance (consider using ELT instead of ETL).
  6. Also be aware of packages that are executed in parallel. Having multiple packages running simultaneously may reduce the available query slots for each package significantly.
  7. And finally: Create a style guide with rules, how many connection each package is allowed to open in parallel and document the amount of connections required per package.

 

When reading this list of options it becomes clear that migrating an existing data warehouse project, with maybe hundreds of ETL packages, from an SMP environment to an MPP appliance is a challenge, as most of these ideas contradict the design principles of SMP warehouses. For example, for a SMP environment, putting workload into the ETL pipeline is often more efficient than doing the job in the database. And ETL packages are easier to maintain compared to SQL scripts.

Sunday, September 22, 2013

Power Query data loading example

Power Query Preview| Hadoop

In order to make the common task of data acquisition and preparation available to end users, Microsoft developed a very powerful enhancement for Excel: Power Query. Actually Power Query is part of the new Power BI series of tools and there is a lot of information out there on the web showing what things you can do with Power Query and the other Power BI tools. Currently, a preview release of Power Query is available as an Excel addin for testing.

Some cool things you can do with Power Query

  • Load data from different sources including relational databases, csv files, Excel files, web pages, OData feeds, Hadoop, even from Active Directory, SharePoint lists or from Facebook.
    For example, you can load all files in a given directory into a single Power Pivot or Excel table
  • Perform transformation rules on the columns, for example split, format, replace, group by, unpivot – using a comprehensive GUI (no coding or writing of formulas needed)
  • Write complex formulas, for example you can use the Table.FillDown function to fill null values in the table with values from the last preceding row that is not null. There is a large number of powerful functions for more advanced users.
  • Load the resulting data set to Excel and/or to a Power Pivot data model for further analysis.

 

A good starting point are the blog posts by Chris Webb:
http://cwebbbi.wordpress.com/category/power-query/

And of course you should check the product web site which you can find here:
http://office.microsoft.com/en-us/excel-help/microsoft-power-query-for-excel-help-HA104003813.aspx

You will find useful specification documents (like the full formula language specification) here:
http://office.microsoft.com/en-us/excel-help/learn-about-power-query-formulas-HA104003958.aspx

There are also a lot of cool videos on Youtube showing examples of Power Query and you can find a lot of interesting blogs about the product and its future in Office 365, tutorials and use cases.

For today’s post, let me show an example of loading data into Power Query. For this example, I’m using a text file with weblog data (w3c) stored on Hadoop. First let’s check the Hadoop File System:

image

The file size is about 500MB large. Let’s load this file into a Power Pivot data model by using Power Query. The first step is to open Excel and from the Power Query Ribbon bar, choose “From Other Sources” / “From Hadoop File (Hdfs)”. You only have to make sure that your Hadoop cluster has WebHDFS enabled.

image

In the following dialog I’m entering the address of my web hdfs folder:

image

Power Query fetches a list of all the files from my hdfs that are located under this directory (recursively):

image

If you like, you may click the arrow-symbol in the header of the Attribute column to include more information (for example the file size). But since we know which file we like to pick, we can simply click on the link “Binary” in the fist row of the table.

Power Query comes back with a sample of the first 100 rows:

image

We will use this sample to build our transformations. First the leading rows starting with a hash sign need to be removed. The easiest way of filtering out rows in Power Query is to use the column filter. So we click on the column filter and select the text filters:

image

Unfortunately, there is a filter for “begins with…” but not for “does not begin with…”. Anyway, we use the “begins with…” filter here, enter the hash sign and click ok. Since there are only few rows starting with a hash sign and Power Query shows the first 100 rows resulting from the query, this query would take some time to finish (the complete file is scanned). We don’t have that much time, so we interrupt the query by clicking

image

We can now edit the formula that was generated by Power Query and negate the boolean operation as shown below:

Before = Table.SelectRows(ImportedText, each Text.StartsWith([Column1], "#"))
After = Table.SelectRows(ImportedText, each not Text.StartsWith([Column1], "#"))

This results in the rows that we need for our data load:

image

The web log file is a delimited file using space as the delimiter, so we right click the column and choose “Split column…”, then “By Delimiter”. Doing a quick count of the spaces shows that we can go for 15 spaces:

image

Column 14 is the user agent. In order to get this column properly, we split it at the + sign and then remove the last two columns of the dataset.

image

Next step is to rename all the columns based on the W3C specification for the web log. Also, we can now rename the Query to “Weblog”.

After clicking Ok, the query definition window closes, and Excel starts to load the data:

image

Since this is a large data set, we don’t want to load it in to an Excel table (which is enabled by default). So without waiting for the data set to be fully loaded, we change the switch “Load to worksheet” to Off causing the resulting data to disappear from Excel:

image

Excel continues the load but the data goes nowhere. You still need to wait until the load is complete (data fully checked).

 image

After the load is complete, the link “Load to data model” becomes available. By clicking this link, the data is transferred to a Power Pivot data model:

image

As you can see, the full data set of 2.9 million rows is now loaded to PowerPivot. From here, we can create a pivot table to analyze the data, for example by status code:

image

So, with just a few clicks, we were able to load and format the weblog file from Hadoop into a Power Pivot data model for analysis. Saving the Excel file results in 34MB, a compression ratio of approx. 1:14 compared with the original file being stored on Hadoop.

Sunday, September 15, 2013

Distributed Partitioned Views: MPP light?

SQL Server 2010/2012

In the MPP approach we distribute data over multiple servers, so that a complex query can be divided into several queries that need to handle less data and that run in parallel. This gives the speed for the MPP solution. Each server processes a certain part of the query independently  from all the other servers. We call this a shared-nothing infrastructure (see my former post for details).

A similar concept are the distributed partitioned views in the SMP SQL Server. A partitioned view is simply a union over tables of the same layout, usually created in the way of the following example:

CREATE VIEW Sales AS
SELECT * FROM Sales2013
UNION ALL SELECT * FROM Sales2012
UNION ALL SELECT * FROM Sales2011
UNION ALL SELECT * FROM Sales2010

In this example we distributed the sales data to 4 tables (sales by year).

A partitioned views is updatable if

If these two conditions are not met, you could still create an instead-of trigger to perform the update, but since we want to use this method to improve performance, the instead-of trigger might not be a good idea.

The check-constraint on the underlying tables could be something like this:

CREATE TABLE Sales2013 (
OrderID INT NOT NULL,
CustomerID INT NOT NULL,
OrderDate DATETIME NULL CHECK (DATEPART(yy, OrderDate) = 2013),
OrderYear INT CHECK (OrderYear = 2013),

CONSTRAINT OrderIDYear PRIMARY KEY(OrderID, OrderYear)
)

Up till now, this has nothing to do with an MPP approach. But now we will look at a distributed partitioned view. A partitioned view is called distributed if the underlying tables sit on different servers. This can be achieved by using linked servers. So, imagine we have the four tables above on four servers (Server1 to Server4). Then the partitioned view may look like this:

CREATE VIEW Sales AS
SELECT * FROM Server1.MyDatabase.dbo.Sales2013
UNION ALL SELECT * FROM Server2.MyDatabase.dbo.Sales2012
UNION ALL SELECT * FROM Server3.MyDatabase.dbo.Sales2011
UNION ALL SELECT * FROM Server4.MyDatabase.dbo.Sales2010

In this situation, queries to the Sales view can be distributed to the four servers with each server solving a part of the query.

Having a distribution over years is actually a bad idea, unless the typical query load is to compare full years. Usually we want to choose a partition key that guaranties a more equal distribution over the servers. One option could be the OrderID, for example OrderID % 4 =0 for server 1, Order % 4 = 1 for server 2 and so on.

Now, with this we’re getting pretty close to what the PDW does, don’t we? Well, let’s look at the difference in detail:

 

distributed partioned view

PDW

Maintenance Rather difficult. Linked server must be created, manual definition of the constraints. Each distribution table has to be created separately. Changes have to be kept consistent over all distributions manually. Easy. Only one create-/alter-table statement with distribution option. The PDW takes care of all the remaining tasks.
Scalability Adding another servers makes it necessary to manually redistribute the data and to adjust the distribution key (table definitions). Constraints must be dropped and recreated in order to do this. Easy. After setting up a new scale unit, the PDW can automatically redistribute the table contents to the available nodes.
High availability If one of the linked servers fails, queries will fail if they touch that distribution. It’s difficult to implement high availability in this scenario. One option could be to double the amount of servers, so that each servers has it’s standby server which replicates the data. In case of an error, the distributed partitioned table definition has to be modified in order to switch. But this would be a very inefficient way to implement HA (twice the amount of servers). Operation System Clustering could be an option but all this has to be configured manually. Out of the box. The PDW automatically switches to an other compute node if one fails and can recreate the failing node using a Hyper-V disk image.
No single point of failure.
Hash key Distribution has to be defined in the table definitions. All underlying tables have to be modified if there are changes to the distribution key. Also the data has to be re-shuffled manually. Distribution key only needs to be managed on the control node. Redistribution (based on other key) is easy using a CTAS statement.
Statistics Statistics are very important for query performance. For distributed partitions views each server maintains the statistics of its distribution, but there is no complete statistics for the entire distributed view. Therefore the server executing the query doesn’t know about the statistics and cannot optimize the queries, Statistics are very important for query performance. Statistics are automatically created on the compute nodes and transferred to the control node using the create/alter statistics statement. Since the query plan is created on the control node, the query optimizer can access a statistic for the table to build an optimal query plan.
Partitioning If the underlying tables need to be partitioned using table-partitioning, the partitioning (including switch operations, partition merges etc.) have to be done per server. Partitions of all distributions are managed centrally on the control node. There is no need to execute statements on the compute nodes.
Queries with star joins Distributed partitioned tables can only leverage multiple servers, if the distributed query can be executed on each server separately. A simple star join breaks this rule unless you rewrite it from

inner join DimDate on FactSales.DateKey=DimDate.DateKey

where DimDate.[Year]=2013

to


where FactSales.DateKey in
(select DateKey from DimDate where DimDate.[Year]=2013)

But this means, you have to modify almost every query to leverage the distributed partitioned view.
With the concept of replicated table, dimension are usually available on each compute node. The query optimizer fully leverages this, so the query can be distributed with no need to rewrite it.
In fact, most queries can be parallelized without any need to modify the query.
Data load Inserts to the view are slow. Loading a massive amount of data will be fastest with bulk inserts on each server. However, this makes it necessary to spit the upload files along the distribution key before doing the bulk upload and to know about the servers which host the distributions. In case of modifications (for example, if another server is added) the loading routines have to be adjusted. Bulk loads are automatically distributed in parallel to all the compute nodes, without the need to split the uploads files beforehand and without the need to know about the distribution key at all.

So, while distributed partitioned tables sound to be promising, comparing them to the PDW infrastructure shows a lot of disadvantages and required additional tasks to make them work. The PDW handles all this in a fully automatic way, thus giving high performance, scalability and high availability in an easy to maintain environment. If you try to mimic this functionality with distributed partitioned tables you will most likely end up with a lot of work and a very hard to maintain system.

So, to get back to the title of this blog post, we’ll have to say

Distributed Partitioned Views: MPP light? No, not at all. PDW offers so much more in maintenance and usability that distributed partitioned views shouldn’t be considered as a replacement for an MPP infrastructure.