Search This Blog


Sunday, April 20, 2014

Parallel Data Warehouse (PDW) and ROLAP – Part 2

PDW 2012

As promised in my previous post about ROLAP on PDW, here are some more tips, tricks and recommendation for using SSAS in ROLAP mode on a PDW source. Please read part 1 first as most of the tips from that post are not being repeated here.

Keep your statistics up to date

Ok, this was also in my last post but it cannot be said often enough. So here it is again. Statistics are most important when it gets to query performance. Make sure you have statistics at least on all columns that participate in join, filtering (where-clause) or aggregation (group by-clause) operations.

Clustered Columnstore Index

As I already mentioned, the clustered column store index (CCI) is the key to running SSAS in ROLAP mode on the PDW as it gives a very good performance. In fact, all big tables in our data warehouse on PDW are currently in CCI storage. CCI stores the data natively in column store compressed format thus being able to achieve the outstanding performance. However, it’s important to understand how the CCI works. In short, when data is written to the CCI, it is first written to a row store table internally. This is called the delta store and the data is stored there as a row group. From that delta store, the data is compressed either synchronously (when larger batches hit the CCI table) or asynchronously (when smaller batches are inserted causing the current row group to reach its storing limit) using the tuple mover process. And since the delta store exists for each distribution and each partition, a lot of rows of the CCI table may actually be in row store mode either waiting to be compressed (row group in status ‘CLOSED’: waiting for tuple mover) or waiting for more rows to arrive (row group in status ‘OPEN’). Imagine a two compute node appliance with 100 partitions, giving 2x8=16 distributions and therefore potentially 1600 open delta stores. Each of those may potentially contain uncompressed rows. And there may be many more row groups in status ‘CLOSED’ also containing uncompressed rows.

The more rows are in uncompressed state, the slower the queries get accessing the data. Therefore it is very important to carefully monitor the state of your CCI tables for uncompressed rows. You can find a very useful query for that in the meta data section of the help file:

-- show rowgroups and status of a table by partition

SELECT IndexMap.object_id, 
object_name(IndexMap.object_id) AS LogicalTableName, AS LogicalIndexName, IndexMap.index_id, NI.type_desc,
IndexMap.physical_name AS PhyIndexNameFromIMap,
100*(total_rows - ISNULL(deleted_rows,0))/total_rows AS PercentFull
FROM sys.objects AS o
JOIN sys.indexes AS i
ON o.object_id = i.object_id
JOIN sys.pdw_index_mappings AS IndexMap
ON i.object_id = IndexMap.object_id
AND i.index_id = IndexMap.index_id
JOIN sys.pdw_nodes_indexes AS NI
ON IndexMap.physical_name =
AND IndexMap.index_id = NI.index_id
JOIN sys.pdw_nodes_column_store_row_groups AS CSRowGroups
ON CSRowGroups.object_id = NI.object_id
AND CSRowGroups.pdw_node_id = NI.pdw_node_id
AND CSRowGroups.index_id = NI.index_id
WHERE = '<insert table_name here>'
ORDER BY object_name(i.object_id),, IndexMap.physical_name, pdw_node_id;

In general, if a lot of rows are in row groups with status ‘OPEN’ or ‘CLOSED’ you should take action:

Row groups in status ‘CLOSED’: The tuple mover process should compress these row groups (new status ‘COMPRESSED’) in the background. However, if there are a lot of row groups in status ‘CLOSED’ the tuple may be behind. In order to help the tuple mover process you should run an alter index … reorganize to compress these row groups. Usually this works pretty fast and the table remains online for queries.

Row groups in status ‘OPEN’: If you find many rows in row groups of status ‘OPEN’ you may consider reducing the number of partitions in your table (if possible) or compress these row groups manually using an alter index … rebuild. However, the rebuild takes some time and needs an exclusive lock on the table. So, if your workload is designed to use partition switching you may want to perform the index rebuild on the source partition before switching it into the final table.

Either way, you should carefully monitor your CCI tables for the ratio of uncompressed row groups in order to have a good query performance, especially for ROLAP queries.

Dimensions in MOLAP or ROLAP?

Now that we’re having the fact tables in ROLAP, what about the dimensions? Wouldn’t it be nice to have them in ROLAP too? From what we experienced in practical use, you should have all your dimensions in classical MOLAP mode. Only if you have a very large dimension, which is difficult to process, I would go for ROLAP. The main reasons are:

First Reason: ROLAP dimensions may give false positives of the duplicate key error at query time. This effect is discussed here and can also be found on technet here (search for “false positives”). However, the solution proposed on technet was to switch error configuration to ignore error:

For a ROLAP dimension, you should disable DuplicateKey errors. This error was enabled by default in AS 2008 to try to help users catch incorrectly defined attribute relationships -- but if you are using ByTable processing or ROLAP storage, then it will only give you false positives.

I don’t think that this is a safe and good approach (see my post here on this topic), so by having the dimensions in MOLAP mode with activated key duplicate error detection, I’m sure the keys are consistent.


Second Reason: From my observation, SSAS generates less and better queries if the dimension is in MOLAP mode (the query still goes to the underlying table using SQL). If the dimension is in ROLAP mode, SSAS fires one additional queries to the dimension table before going to the fact tables. For MOLAP dimensions we see just one queries joining the fact table to the dimension tables.

Which resource class should I use?

Higher resource classes on the PDW (more query slots) usually improve the query performance of high workload queries at the price that less queries can be run in parallel. But for ROLAP we want many users to execute rather simple queries in parallel. What makes things worse is that some ROLAP operations are solved using multiple queries for only one pivot table. Therefore we made the best practical experience using smallrc (default) resource class for the SSAS queries.

Take care with non-aggregatable data types

For a MOLAP cube it’s fine to have a measure of type datetime (hidden, just for drill through) and set the aggregation function to ‘none’. But for ROLAP the SSAS engine still tries to fetch a sum for those measures causing the query to fail because sum aggregate is not supported for datetime. One solution is to use aggregation functions min or max for these cases.

Also, you should take care with measures of type integer. Even if you’re only using those measures as a flag on an intermediate bridge table, the SQL query may result in an integer overflow error.

Use SNAC as data base driver

As described before, PDW should be addressed through the simple SNAC driver (SQL Server Native Client) as shown in the screenshot below:



If you get the error “42000, Parse error at line: 1, column: 51: Incorrect syntax near @ResultCode”, you probable have proactive caching set to on which isn’t supported on the PDW yet.


The reason for the error is, that for proactive caching, SSAS tries to execute the stored function sp_trace_create on the data source:

DECLARE @ResultCode INT;


EXEC @ResultCode = sp_trace_create @TraceId OUTPUT, 1;

SELECT @ResultCode as ResultCode, @TraceId as TraceId;

Monday, April 7, 2014

Discover missing rows of data

PDW 2012 | SQL Server 2012 | SQL Server 2014

If your source data contains a subsequent number without gaps it’s relatively easy to find out if data rows are missing. The approach I’m showing here uses window functions that are available since SQL Server 2012 and SQL Server Parallel Data Warehouse 2012.

In order to have some sample data for this post, I’m using the FactInternetSales table of the AdventureWorksDW2012 database. Let’s pretend the column SalesOrderNumber of that table should not have any gaps. I convert the column data to a numeric type and use only the rows having line item sequence number equal to 1 for my sample data.

convert(int, substring(SalesOrderNumber,3,255)) SalesOrderIntNumber
FROM [FactInternetSales]
WHERE [SalesOrderLineNumber]=1
ORDER BY SalesOrderNumber


Usually the order number is sequentially but we find some gaps here. For example, the order following order number SO43842 is SO43918, so there are 43918 – 43842 – 1 = 75 rows missing.

Using window functions and a sub query, we can add the next number as a column to the query together with the distance:

select *, NextSalesOrderIntNumber-SalesOrderIntNumber-1 MissingRows
  lead(SalesOrderIntNumber,1) over (order by SalesOrderIntNumber) NextSalesOrderIntNumber
(SELECT SalesOrderNumber, convert(int, substring(SalesOrderNumber,3,255))
  SalesOrderIntNumber FROM [FactInternetSales] where [SalesOrderLineNumber]=1
) TransactionData
) TransactionDataSequence


As you can see, the 75 missing rows are now being reported correctly by the query. The only task left to do now is to aggregate the amount of missing rows by replacing the outer query like this:

select Sum(NextSalesOrderIntNumber-SalesOrderIntNumber-1) MissingRows
  lead(SalesOrderIntNumber,1) over (order by SalesOrderIntNumber) NextSalesOrderIntNumber
(SELECT SalesOrderNumber, convert(int, substring(SalesOrderNumber,3,255))
  SalesOrderIntNumber FROM [FactInternetSales] where [SalesOrderLineNumber]=1
) TransactionData
) TransactionDataSequence



As a quality measure you could show the ratio of the missing rows to the total rows (or 100% minus this ratio as a data completeness measure) and – assuming that the missing rows had an average sales amount – also the estimated missing amount. And it’s also useful to get the result on more granular level, for example per month. Here is the full query:


orderdatekey/100 [Month],

Sum(NextSalesOrderIntNumber-SalesOrderIntNumber-1) MissingRows,


* Sum([ExtendedAmount]) MissingRowsEstimatedValue

  lead(SalesOrderIntNumber,1) over (order by SalesOrderIntNumber) NextSalesOrderIntNumber,
  [ExtendedAmount], OrderDateKey
(SELECT SalesOrderNumber, convert(int, substring(SalesOrderNumber,3,255))
SalesOrderIntNumber, [ExtendedAmount], OrderDateKey
FROM [FactInternetSales] where [SalesOrderLineNumber]=1
) TransactionData
) TransactionDataSequence

group by orderdatekey/100
order by orderdatekey/100


Plotting the result over the time gives a good overview. For my example data, quality improved a lot since August 2007.


Conclusion: This is another example how window functions provide an elegant solution for solving analytical data tasks. And since this works perfectly on a PDW, the approach works well even with billions of rows of data.

Sunday, March 23, 2014

Reading recommendation: “3 Ways to Test the Accuracy of Your Predictive Models”

Data Mining

Imagine being in a data mining project where people are all enthusiastic about finding hidden patterns and doing some data magic. Now you set up and train your model and the model gives some results (for example scores for the test dataset). You might be feeling a little bit lost at that point. Are the results sensible and reliable? Did I really choose a good approach, the right model and algorithm for that problem? The fact is, that most mining tools are giving you “results” even if your modeling approach has severe flaws, but the results simply have no meaning. So, how can you be sure that your model works fine?

The situation reminds me a little bit of Douglas Adams’ Hitchhiker’s Guide to the Galaxy. Finally you got an answer, but you are not longer sure what this means. What options do you have to be assured (and to assure your stake holders) that business decisions can be based upon the results of that mining model?

Victoria Garment from Software Advice, a website that researches business intelligence systems, gathered methods used by data mining professionals Karl Rexer, Dean Abbott and John Elder to test and validate the accuracy of a mining model. You can find the full report here:

What makes this article a must-read for me, is that it doesn’t only cover methods for accuracy testing (lift charts, decile tables, target shuffling, bootstrap sampling, cross validation) but also contains many practical examples and touches various side aspects that are important for data mining. For example when John Elder talks about recognizing false patterns like the Redskins Rule. Or Dean Abbott, as he shows how easily models can be overfit and what methods you have to correct them. I especially like one quotation by John Elder: “Statistics is not persuasive to most people—it’s just too complex”. And this is true, as from my own experience it is very important not only to design a good prediction model but also to assure the business decision makers that they can trust the results and base their decisions on those results. Target shuffling as described in the article may be one promising approach to make business people trust the results of a predictive model without being a master of science in statistics.

Again, the full article can be found here:

Sunday, March 16, 2014

Converting events to hourly based aggregations

PDW 2012 | SQL Server 2012

For today’s post I didn’t find a really good title. Here’s what this post is about: Sometimes you’ll find event based data in your source system (something happens at a specific point in time) but for the data warehouse you want to transform this data to match a given time dimension. The situation is similar to an older post I wrote about SQL Server window functions.

There are some approaches to accomplish this task. For today’s post I’d like to show a SQL-based approach we’re currently using in an ELT process on a Parallel Data Warehouse (PDW) 2012.

Let’s assume you’re working for a retailer who is interested in the number of cash desks being open at a time. A subset of the event data (single day, single store) may look like this:


In this case cash desk 2 opened at 06:35, then desk 1 opens at 8:27, then desk 2 closes at 11:58 and so on. The question is, how many cash desks are open for example from 08:00 to 09:00? If a cash desk is only open for half an our in the given time range, it should be counted as 0.5, so between 08:00 and 09:00 approximately 1.5 cash desks where open (desk 2 for the full hour and desk 1 for half of an hour).

In order to get the number of cash desk being open, we first convert the transaction type to a delta: +1 means a desk opens, –1 means a desk closes. Here is the query together with the result:

  , StoreID
  , TransactionTime
  , CashdeskID
  , case TransactionType
    when 'signon' then 1
    when 'signoff' then -1
  else 0
  end CashdeskDelta
from (
  select TransactionDate, StoreID, TransactionTime, CashdeskID,TransactionType
  from POSData where TransactionType in ('signon','signoff')) P
order by 1,2,3

The result (again a subset) may look like this:


After transforming the transaction type to a numeric value, we can aggregate it using a window function. Therefore I’m using the query from above as a sub query:

select *,
sum(CashdeskDelta) over (partition by TransactionDate, StoreID order by [TransactionTime]) OpenCashdesks,
datediff(s,[TransactionTime],lead([TransactionTime],1) over (partition by TransactionDate, StoreID order by [TransactionTime])) TimeDelta
-- query from above --
) CheckoutDetails
order by 1,2,3

Again, this shows the power of the window functions. The query gives us the number of open cash desks together with the number of seconds to the next event.


For example, from 8:27 to 11:58, 12622 seconds passed. During this time, 2 cash desks were open. This is a huge step towards the solution but we still have no hour based time frame in the data. However, this can easily be created by cross joining the dimensions for store and time. For my example, I have no store or time dimension (as you should usually have), so I’m using the table sys.all_objects here to generate a sufficient number of data rows:

Timeslot AS (
select T2.TransactionDate, T2.StoreID, T1.TimeGrid from
(select top 24 convert(time,dateadd(hour,row_number () over (order by [name])-1,'00:00:00')) TimeGrid from sys.all_objects) T1
cross join
(select distinct TransactionDate, StoreID from POSData) T2


The query creates one row per hour for each store and each date. Again, usually you would use your existing dimension tables instead of the sys.all_objects table here.

Now, let’s bring both queries together:


Timeslot AS
.. – timeslot query from above

CashDeskTimesDetails as

select *,
sum(CashdeskDelta) over (partition by TransactionDate, StoreID order by [TransactionTime])
,sum(CashdeskDelta) over (partition by TransactionDate, StoreID order by [TransactionTime])*
TimeDeltaSeconds CashdeskOpenSeconds
,convert(time, dateadd(hour, datediff(hour, 0, TransactionTime),0)) TransactionHour
, StoreID
, TransactionTime
, coalesce(
datediff(s,[TransactionTime],lead([TransactionTime],1) over (partition by
TransactionDate, StoreID order by [TransactionTime]))
datediff(s,[TransactionTime],dateadd(day,1,0)) -- fill seconds to end of day
, CashdeskID
, case TransactionType
when 'signon' then 1
when 'signoff' then -1
else 0
end CashdeskDelta
from (
select TransactionDate, StoreID, TransactionTime, CashdeskID,TransactionType from
POSData where TransactionType in ('signon','signoff')
union all
select TransactionDate, StoreID, TimeGrid, 0, 'timeslot' from Timeslot
) P
) CheckoutDetails
select * from CashDeskTimesDetails
order by 1,2,3

The result shows the original data together with the fixed time frame (24 hours).


Some things to pay special attention to:

  • The inserted timeslots are created with a Cashdesk delta value 0, so they do not modify the number of open/closed desks (column CashdeskOpen)
  • In blue: the last time slot has no subsequent timeslot. Therefore the window function returns null. Here we override this with the number of seconds until day’s end.
  • In red: We add the base hour to each row. This will be used for a group-by operation in the following step

Finally, we simply need to aggregate the last query result:

Sum(convert(float,CashdeskOpenSeconds)) / Sum(convert(float,TimeDeltaSeconds)) CashdeskCount
from CashDeskTimesDetails
where TimeDeltaSeconds<>0
group by TransactionDate,TransactionHour,StoreID
order by 1,2,3


Here is the final result for the sample data subset from above:


Removing all filters (all dates, all stores) may result in a graph like this:


So this post showed how to transform event based data to a fixed time scale (hours in this case) to match a star schema join to the time dimension. Since we only used SQL this process can be easily used in an ELT loading scenario.

Sunday, March 2, 2014

ETL or ELT… or both??

PDW 2012 | SQL Server 2005-2014

With database servers getting more and more powerful, some traditional concepts of Business Intelligence solution may be reconsidered. One of those concepts for which you can find a lot of lively debates recently is the question of whether to use ETL or ELT.

Here are just a few of the blog posts and discussions you can find on this topic:

In short, the main differences are shown in the table below:

  • ETL=Extract-Transform-Load
  • Transformation done in the ETL tool (data flow pipeline), only the finally prepared data is loaded to the data warehouse data base
  • ELT=Extract-Load-Transform
  • Original data is loaded into database, then SQL is used to transform the data into the dimensional model
  • Well established ETL tools available with debugging, build in logging, configuration, error handling and process reporting and statistics
  • comprehensive and easy to maintain data flow makes it easy to merge data from different sources, use of data quality tools and individually deal with error rows.
  • ELT fully benefits from the database power, query optimizer and so on.
  • Especially for MPP environments (like the PDW): Scaling the database means scaling up for ELT process perfromance as well
  • SQL code is easier to maintain in source control systems (like TFS) than ETL packages (complex XML).
  • ETL pipeline tools support multiple cores, but parallel IO has to be solved programmatically: you have to do something
    (for example use the balanced data distributor SSIS component or merge multiple inputs from the same source table)
  • ETL tools are build for row based operations. Operations that need to be performed on a set of rows (like sort and aggregate or calculations covering multiple rows) are harder to solve. I wrote some posts recently about ELT calculations that are relatively difficult to solve in ETL.
  • SQL is harder to read, to structure and to document compared to ETL packages
  • You need discipline as minor wrongness may lead to errors that are hard to track down (e.g. too many resulting lines from join operation if a key is missing in the join)

This comparison is by far not complete and if you read the links above (and many others that target this topic) you can find a lot more pros/cons and opinions. In fact, I don’t want to say one is better than the other. But here is what we recently found to work well in a large project using the Parallel Data Warehouse (PDW) for an initially 30TB (and growing) database. The following illustration which I recently used on SQL Conference 2014 in Germany shows the key concepts:



We’re using ETL (Microsoft SQL Server Integration Services) to

  • Orchestrate the load process
    • workflow management (make sure the right things happen in the right order)
    • dealing with technical challenges (e.g. temporary tables, partition switching on PDW)
    • implement configuration management (for example server and database names)
    • logging and process monitoring (reports)
  • Load dimensions (small amount of data)
    • Collecting master data from source systems, merge and prepare this data
    • Generation of surrogate keys for the dimensions
    • Keeping track of historical changes (modeled as SCD2 or intermediate bridge tables)
    • Building up all dimensions and transferring the data to the PDW (using a reload operation)
    • Early arriving facts (create missing dimension rows, distinct counts run on PDW)


  • Integration Services (SSIS) well suited for these tasks
  • SMP SQL Server offers good support for dimension processing tasks (identity column, T-SQL merge statement etc.)
  • Additional services like SQL Server Data Quality Service (DQS) and SQL Server Master Data Services (MDS) are currently not supported to run on the PDW
    This is also true for more sophisticated tasks and the use of web services for example to find duplicate customers,to correct misspelled street names, to guess the gender from the first name. Also if you need to use custom assemblies, for example to access special source systems or include specific calculations, ETL tools are the better choice.


Then, we’re using ELT (distributed SQL, DSQL) on the Parallel Data Warehouse to

  • process fact data (large amount of data) after it is bulk loaded with no modifications into a staging database on the PDW
    • Data preparation (for example removing duplicate rows)
    • Linking fact table data to dimensions
    • Performing calculations (using SQL window functions intensively)
  • Merge new data to archive
    • store the data in the persisted stage area (without creating duplicates if the data was there already)


  • Much better performance observed compared to SMP SQL Server/SSIS
    • in our case, usually about 10-20 times faster, depending on the source data and the transformations
    • In some cases (for example removing duplicate rows in the source data) even 100 times faster
  • Faster loads allow us to fully reload many TB in case this is needed (this gives more options for backup strategies and for the dimensional modeling)
  • Solution will directly benefit from future MPP scale up without any need of coding



ETL and ELT may work well together. In this scenario we did the dimension processing as well as the full workflow management using ETL tools (SSIS on an SMP SQL Server) and the processing of the large transactional tables using ELT (distributed SQL on PDW).

Sunday, February 16, 2014

Parallel Data Warehouse (PDW) and ROLAP

PDW 2012 | SQL Server 2012 | SQL Server 2014

This post is about using the Parallel Data Warehouse as a ROLAP source for SSAS. For PDW v1 this wasn’t recommended but the quintessence of this post is, that this really works well with PDW 2012. In fact, this is the first time I saw MOLAP performance on large ROLAP tables (over a billion rows) and again, another big plus for the PDW with the column store index. In fact, I’m really excited about this (and I’ll tell you why in a minute), but maybe I wasn’t loud enough. So here again:

“Using SSAS ROLAP with PDW 2012 is
working really well!!!

But, and I have to lower my voice again, I have to agree with Chris Webb that there is almost no information about it out there. So enough reason to write about this truly amazing story.

Before I’m going into some relevant topics, let me briefly recap the benefits of ROLAP against MOLAP:

Daumen hoch LOW LATENCY No need to process MOLAP partitions: low latency (data in the relational data warehouse tables are immediately available to the end users)
Daumen hoch NO/LESS STORAGE REQUIRED The ROLAP cube only contains the model, not the data. Therefore almost no disk space is required for storing the cube. It’s just the presentation of the model. The MOLAP/ROLAP is a technical implementation issue which is not visible to the end user. For both options, the end user gets an easy to use, highly interactive quick responding data model, which can be used from many tools including Excel pivot tables, Reporting Services, Tableau and other advanced analytical frontend tools. 
No need to design and maintain partitions in the cube (see remarks regarding partitioning below): less development and maintenance afford (for example for daily delta updates)
Daumen hoch MORE FLEXIBLE In MOLAP, many changes to a dimension require a full processing of the dimensions which results in all the attached measure group partitions switching to the ‘unprocessed’ state and need to be processed again.
If you have a large cube, this process could take many hours. In ROLAP, all this is not necessary. Changes to cube dimensions are online immediately.
Daumen hoch EASY DEPLOYMENT Development, testing and deployment to production is much easier since the data is immediately available to end users
Daumen hoch SUPPORTS LARGE DIMENSIONS Large dimensions (with many million rows) are difficult to handle for MOLAP SSAS. Processing takes a long time and query performance may go down. But ROLAP works well with large dimensions.
Daumen hoch SUPPORTS VERY LARGE FACT TABLES MOLAP cube sizes of 4 or 5 TB are possible and due to the compression in the cube storage, this corresponds to fact table sizes of 50 TB and more. However, if you go beyond, there is a point where only ROLAP cubes can solve the amount of data.

So there are many advantages when using ROLAP partitions in SSAS. However, there always was a big disadvantage:

Daumen runter BAD PERFORMANCE Poor query performance for ROLAP partitions compared to MOLAP partitions.

Now, with the memory optimized column store index, especially with the parallel query engine of the PDW, you can get an incredible good query performance from ROLAP partitions. Therefore, we have to cross out this disadvantage:

Poor query performance for ROLAP partitions compared to MOLAP partitions.
With column store index, ROLAP partitions are really fast

And since column store index is also available on SQL Server 2012 (non-clustered, read-only) and 2014 (clustered, updatable) this should also apply to the SMP SQL Server (I haven’t tested it out with huge amounts of data though).

Here are some remarks/recommendations if you’re planning ROLAP on PDW:

Clustered columnstore index

As mentioned above, the clustered column store index of the PDW is the key to using ROLAP on PDW 2012 and maybe the most important reason why ROLAP is now a reliable option on PDW at all. So make sure, your (fact-) tables are stored in clustered column store mode.


Fast network connection between PDW and Analysis Services

Obviously, a fast network connection between the SSAS server and the PDW is important to get a good performance. Of course this is also true for MOLAP or mixed environments. As of today, I would recommend to add the SSAS server to the appliance’s infiniband network.


Table layout: distributed/replicated

Most of the ROLAP queries will basically look like

select Dim1.Property1, Dim2.Property2, Sum(Fact1.Amount) SumOfAmount
from Fact1
inner join Dim1 on Fact1.Dim1Key=Dim1.Dim1Key
inner join Dim2 on Fact1.Dim2Key=Dim2.Dim2Key
group by Dim1.Property1, Dim2.Property2

In order to have queries like this respond well, the tables should be distribution-compatible. In many cases you can achieve this by turning the dimension tables into replicated tables. I have more detailed explanation on distribution and aggregation compatibility in some older posts and there is also a good post by Stephan Köppen about this topic here. An incompatible distribution when joining two large fact tables (for example a fact table with a many-to-many bridge table) results in shuffle move or even broadcast move operations, that are also fast, but not lightning fast as you would expect for online analytical applications. So my recommendation is to carefully choose the distribution keys so that the required joins can be resolved locally. Aggregation compatibility is more difficult to achieve for all types of queries. However, from my current experience PDW responded very fast even if the query was not aggregation compatible.



Having a large amount of data in MOLAP or ROLAP cubes usually requires partitioning of the MOLAP/ROLAP measure groups. For MOLAP, recommendations vary from about 20-50 million rows per partition. Therefore storing a billion rows results in at least 20-50 partitions. In practical scenarios you often end up with many more partitions in order to implement daily incremental loading. But because PDW is designed to run large queries it’s much better to use only one partition, instead of firing a small-shot charge of queries to the appliance. Internally the PDW uses distributions in order to run the query using all cores in parallel, so there is no need to create partitions for performance reasons.
By the way, since many small queries require more computing power then a few complex queries you should be careful with Excel’s pivot option “convert to formula”…


Distinct Count

As described in this blog post by Anthony Mattas (and in many other posts), you should set EnableRolapDistinctCountOnDataSource in the Analysis Services properties in order to compute the distinct count calculation on the database instead of fetching the distinct rows to Analysis Services.

Please note that this property is not yet available in the server properties but must be set manually in the msmdsrv.ini file (which can be found below the instance in the OLAP\Config sub directory).



Having all your table statistics up to date

This generally is a very important thing when working with the PDW, not only when using ROLAP. While the compute nodes have auto create/auto update enabled, statistics are not (apart from very few cases) automatically created/updated on the control node. Without proper statistics, PDW cannot create an optimal distributed query plan. The simple thing is, that for most cases, where performance is an issue with PDW, incorrect statistics are the root cause.


Dealing with large dimensions

In some cases, having a measure group with many attached dimensions can cause problems if those dimensions are referenced in the query (on rows, columns, filter). I’m currently trying to narrow this down, but one possible reason could be the missing primary key constraints on the PDW together with large dimensions. Consider this simple query:

select P.Color, Sum(S.SalesAmount) SumOfSalesAmount
from [dbo].[FactInternetSales] S
inner join [dbo].[DimProduct] P on S.ProductKey=P.ProductKey
group by P.Color

If you have a primary key on the dbo.DimProduct.ProductKey, the optimizer knows that the inner join cannot produce more rows than exist in the fact table because for each row from the fact table we can only find at most one row in the dimension table. Without the primary key (which is the situation in the PDW) the optimizer has to consider density information from the statistics. This will work pretty well, but let’s say that for a larger dimension the statistics gives something like: “for each row from the fact table, you might be getting 1.3 rows from the dimension table”. Again, nothing much happened here. But assuming you have many dimensions, the effect may grow exponentially. With 8 dimensions and 30% over guess you would end up at 1.38 = 8.16. So instead of querying for example a billion rows, the optimizer thinks that we’re about to query 8 billion rows. This could have a huge effect on the query plan. If you encounter such issues, one option could be to convert the dimensions in the data source view to query binding. For example, the query for the product dimension may look like this:

Min(Color) Color,
Min(EnglishProductName) EnglishProductName,
Min(ListPrice) ListPrice

group by ProductKey

Since ProductKey is actually a logical primary key, rewriting the dimension query this way gives the same result as

select ProductKey, Color, EnglishProductName, ListPrice from DimProduct

but because of the group by operation, the optimizer now know for sure, that the ProductKey is unique thus giving a better estimate of the resulting rows.

Again, I’m still investigating these cases and the benefit of the query rewrite, but if you encounter performance issues, this may be one option to try.


Aggregation design and proactive caching

Since ROLAP partitions rely on indexed views in the data source, you cannot use ROLAP aggregations on the PDW 2012. However, from the query response we got so far, there might not be much need for aggregations at all if your data is stored as a clustered columnstore index. If you need aggregations, you could try HOLAP aggregations. We haven’t tried this so far, but I’m planning to do more investigation.

Proactive caching is currently only supported in polling mode (not with the trace mechanism).


Since we’re still in the process of adjusting our PDW ROLAP environment I’m going to write more posts with tips and tricks, so stay tuned. Actually, we’re investigating these topics:

  • How does ROLAP perform with role based security in the SSAS cube?
  • How does ROLAP perform with many users?
  • How does ROLAP work with more complicated MDX calculations involving PARALLELPERIOD, aggregates (AGGREGATE, SUM, MIN, MAX etc.) over dynamic sets etc.? Which MDX calculations are performing better / for which calculations shall we still use MOLAP?

Also some interesting recommendations (for example regarding the “count of rows” aggregation or the proper setup of your Analysis Services server) can be found in the SQLCAT Whitepaper Analysis Services ROLAP for SQL Server Data Warehouses.

Sunday, February 9, 2014

Removing duplicate rows in large tables

PDW 2012 | SQL Server 2012

Removing duplicate rows in large tables can be a challenging operation. As long as the table does not contain too many rows, several approaches will lead to the desired result. Recently, I had a table of about 2.5 billion rows and 125 columns. The table is located on a Microsoft Parallel Data Warehouse (PDW), so I had to perform the cleaning of the table using distribute SQL (DSQL).

For simplicity, let’s assume the table has columns key1, key2, key3, key4, key5 and col1, … col120. The table is distributed by key1. The rows should be unique regarding the five key columns but this is not guaranteed, since the key columns do not form a database enforced primary key.

Such tasks are usually easy to solve on a PDW using a CTAS (create-table-as-select) statement. For my example I used the following statement:

create table new_table
with (distribution = hash (key1), clustered columnstore index)
as select * from
(select *
, row_number() over (partition by key1,key2,key3,key4,key5 order by col1) RowNr
from my_table
) X
Where X.RowNr = 1

One remark regarding the window function (row_number over). Make sure, that the distribution key is the first column in the partition list, so the computation of the row number can happen on each compute node separately without re-distributing the data. And another remark regarding the “select * …”. Usually it’s not a good style of writing SQL code but in this case we really want to transfer all columns and the DSQL pre- processor expands the * to the corresponding field list before sending the query to the compute nodes (this can be seen in the query log).

Running this query on the full set of 2.5 billion rows took about 1 hour 15 minutes to complete.

Analyzing the query details showed that the sub-query for finding the duplicate rows was easy to solve for the PDW. Wrapping the sub-select in a count returned after about 1 minute giving the exact number of unique rows:

Select count(*) from
(select *
, row_number() over (
partition by key1,key2,key3,key4,key5
order by col1) RowNr
from my_table
) X
Where X.RowNr = 1

So most of the time for this query originates from copying data from one table to another. While usually it’s a great idea to CTAS into a new table, it could be a bad choice for this scenario when you have a lot of rows with many columns and relatively few duplicates.

Here is the approach, I finally used:

  1. Write the keys of the duplicate rows to a new temporary table #DupKeyStore
  2. CTAS from the original table to another temporary table #UniqueRowStore only the rows that also exist in the #DupKeyStore by using the approach from above.
  3. Delete all rows from the original table that match the keys in the #DupKeyStore table. The original table does not contain any duplicate rows anymore - the rows that had duplicates are removed.
  4. Insert the rows of the table #UniqueRowStore back into the original table

Here is the code in more detail.

Step 1

create table #DupKeyStore
with (LOCATION = USER_DB, distribution=hash(key1))
key1, key2, key3, key4, key5
, convert(nvarchar,key1)
+'-'+convert(nvarchar,key5) KeyLookup
(select key1, key2, key3, key4, key5
from mytable
group by
key1, key2, key3, key4, key5having count(*) > 1
) X

The query took 1 minute to complete on the full data set of 2.5 billion rows. The ‘KeyLookup’ column looks strange but I will explain this later in the process.

Step 2

create table #UniqueRowStore
with (LOCATION = USER_DB, distribution=hash(key1))
select * from
(select Q.*,
row_number() over (partition by
Q.key1, Q.key2, Q.key3, Q.key4, Q.key5
order by Q.key5) RowNr
from mytable Q
inner join #DupKeyStore K
And Q.key2=K.key2
And Q.key3=K.key3
And Q.key4=K.key4
And Q.key5=K.key5
) X
where RowNr=1

Again, this query took about 1.5 minutes to transfer the duplicate rows into a new table while removing duplicates. The main performance gain here is that the query does not need to transfer all rows, but only the duplicate ones. Another thing to notice here is that I’m using a heap for this intermediate table (not a columnstore index).

Step 3



delete from mytable
IN (select KeyLookup from #DupKeyStore)

Ok, this query needs some remarks. First, we use the concatenated KeyLookup-column from above. I must admit that this looks somewhat ugly. It would be better to use a join in the select but this is currently not supported on the PDW. Another option would be to use EXISTING(select….) in the where condition which would remove the ugly concatenation of the keys. However, during my tests, this approaches took a lot more time to complete. So I ended up with the statement from above.

For my example, the query took about 1 minute to complete.

Step 4

insert into mytable
select key1, key2, key3, key4, key5, col1, col2, … col120
from # UniqueRowStore

The insert was quite fast, only taking 30 seconds.

At a total of 4 minutes the 2.5 billion rows were cleaned from duplicates. Compared to the 1 hour 15 minutes using the first approach, this is a huge performance boost. And having more compute nodes would result in a much shorter query time for this task.

Before extending this solution to even more rows, consider the following topics:

  1. Check the number of duplicates first (table #DupKeyStore). If there are too many duplicates (I would say more than 5% of the total rows) consider the CTAS operation instead of the delete/insert operation.
  2. The delete/insert operations require tempdb space. For a larger amount of rows, these statements should be split to batches having one transaction for a bunch of keys.
    Since delete top(nnn) is not supported on PDW, SET ROWCOUNT does not work either here and you also don’t have a @@ROWCOUNT a good approach is add “cluster-column” (e.g. row_number() modulo something) into #DupKeyStore and to use this key for splitting into batches. Remember to wrap the delete and the insert into a single transaction. See this post by Stephan Köppen for details.
  3. CTAS operations writing to clustered columnstore index tables perform better using a higher resource class. This could also be considered when using workarounds like the one from above.

So, that was it for today’s post. Ah, no, I’ve forgotten something:

drop table #DupKeyStore;
drop table #UniqueRowStore;