Search This Blog

Sunday, February 16, 2014

Parallel Data Warehouse (PDW) and ROLAP

PDW 2012 | SQL Server 2012 | SQL Server 2014

This post is about using the Parallel Data Warehouse as a ROLAP source for SSAS. For PDW v1 this wasn’t recommended but the quintessence of this post is, that this really works well with PDW 2012. In fact, this is the first time I saw MOLAP performance on large ROLAP tables (over a billion rows) and again, another big plus for the PDW with the column store index. In fact, I’m really excited about this (and I’ll tell you why in a minute), but maybe I wasn’t loud enough. So here again:

“Using SSAS ROLAP with PDW 2012 is
working really well!!!

But, and I have to lower my voice again, I have to agree with Chris Webb that there is almost no information about it out there. So enough reason to write about this truly amazing story.

Before I’m going into some relevant topics, let me briefly recap the benefits of ROLAP against MOLAP:

Daumen hoch LOW LATENCY No need to process MOLAP partitions: low latency (data in the relational data warehouse tables are immediately available to the end users)
Daumen hoch NO/LESS STORAGE REQUIRED The ROLAP cube only contains the model, not the data. Therefore almost no disk space is required for storing the cube. It’s just the presentation of the model. The MOLAP/ROLAP is a technical implementation issue which is not visible to the end user. For both options, the end user gets an easy to use, highly interactive quick responding data model, which can be used from many tools including Excel pivot tables, Reporting Services, Tableau and other advanced analytical frontend tools. 
Daumen hoch LOWER PROJECT COSTS
No need to design and maintain partitions in the cube (see remarks regarding partitioning below): less development and maintenance afford (for example for daily delta updates)
Daumen hoch MORE FLEXIBLE In MOLAP, many changes to a dimension require a full processing of the dimensions which results in all the attached measure group partitions switching to the ‘unprocessed’ state and need to be processed again.
If you have a large cube, this process could take many hours. In ROLAP, all this is not necessary. Changes to cube dimensions are online immediately.
Daumen hoch EASY DEPLOYMENT Development, testing and deployment to production is much easier since the data is immediately available to end users
Daumen hoch SUPPORTS LARGE DIMENSIONS Large dimensions (with many million rows) are difficult to handle for MOLAP SSAS. Processing takes a long time and query performance may go down. But ROLAP works well with large dimensions.
Daumen hoch SUPPORTS VERY LARGE FACT TABLES MOLAP cube sizes of 4 or 5 TB are possible and due to the compression in the cube storage, this corresponds to fact table sizes of 50 TB and more. However, if you go beyond, there is a point where only ROLAP cubes can solve the amount of data.

So there are many advantages when using ROLAP partitions in SSAS. However, there always was a big disadvantage:

Daumen runter BAD PERFORMANCE Poor query performance for ROLAP partitions compared to MOLAP partitions.

Now, with the memory optimized column store index, especially with the parallel query engine of the PDW, you can get an incredible good query performance from ROLAP partitions. Therefore, we have to cross out this disadvantage:

Smiley BAD PERFORMANCE
GOOD PERFORMANCE
Poor query performance for ROLAP partitions compared to MOLAP partitions.
With column store index, ROLAP partitions are really fast

And since column store index is also available on SQL Server 2012 (non-clustered, read-only) and 2014 (clustered, updatable) this should also apply to the SMP SQL Server (I haven’t tested it out with huge amounts of data though).

Here are some remarks/recommendations if you’re planning ROLAP on PDW:

Clustered columnstore index

As mentioned above, the clustered column store index of the PDW is the key to using ROLAP on PDW 2012 and maybe the most important reason why ROLAP is now a reliable option on PDW at all. So make sure, your (fact-) tables are stored in clustered column store mode.

 

Fast network connection between PDW and Analysis Services

Obviously, a fast network connection between the SSAS server and the PDW is important to get a good performance. Of course this is also true for MOLAP or mixed environments. As of today, I would recommend to add the SSAS server to the appliance’s infiniband network.

 

Table layout: distributed/replicated

Most of the ROLAP queries will basically look like

select Dim1.Property1, Dim2.Property2, Sum(Fact1.Amount) SumOfAmount
from Fact1
inner join Dim1 on Fact1.Dim1Key=Dim1.Dim1Key
inner join Dim2 on Fact1.Dim2Key=Dim2.Dim2Key
group by Dim1.Property1, Dim2.Property2

In order to have queries like this respond well, the tables should be distribution-compatible. In many cases you can achieve this by turning the dimension tables into replicated tables. I have more detailed explanation on distribution and aggregation compatibility in some older posts and there is also a good post by Stephan Köppen about this topic here. An incompatible distribution when joining two large fact tables (for example a fact table with a many-to-many bridge table) results in shuffle move or even broadcast move operations, that are also fast, but not lightning fast as you would expect for online analytical applications. So my recommendation is to carefully choose the distribution keys so that the required joins can be resolved locally. Aggregation compatibility is more difficult to achieve for all types of queries. However, from my current experience PDW responded very fast even if the query was not aggregation compatible.

 

Partitions

Having a large amount of data in MOLAP or ROLAP cubes usually requires partitioning of the MOLAP/ROLAP measure groups. For MOLAP, recommendations vary from about 20-50 million rows per partition. Therefore storing a billion rows results in at least 20-50 partitions. In practical scenarios you often end up with many more partitions in order to implement daily incremental loading. But because PDW is designed to run large queries it’s much better to use only one partition, instead of firing a small-shot charge of queries to the appliance. Internally the PDW uses distributions in order to run the query using all cores in parallel, so there is no need to create partitions for performance reasons.
By the way, since many small queries require more computing power then a few complex queries you should be careful with Excel’s pivot option “convert to formula”…

 

Distinct Count

As described in this blog post by Anthony Mattas (and in many other posts), you should set EnableRolapDistinctCountOnDataSource in the Analysis Services properties in order to compute the distinct count calculation on the database instead of fetching the distinct rows to Analysis Services.

Please note that this property is not yet available in the server properties but must be set manually in the msmdsrv.ini file (which can be found below the instance in the OLAP\Config sub directory).

image

 

Having all your table statistics up to date

This generally is a very important thing when working with the PDW, not only when using ROLAP. While the compute nodes have auto create/auto update enabled, statistics are not (apart from very few cases) automatically created/updated on the control node. Without proper statistics, PDW cannot create an optimal distributed query plan. The simple thing is, that for most cases, where performance is an issue with PDW, incorrect statistics are the root cause.

 

Dealing with large dimensions

In some cases, having a measure group with many attached dimensions can cause problems if those dimensions are referenced in the query (on rows, columns, filter). I’m currently trying to narrow this down, but one possible reason could be the missing primary key constraints on the PDW together with large dimensions. Consider this simple query:

select P.Color, Sum(S.SalesAmount) SumOfSalesAmount
from [dbo].[FactInternetSales] S
inner join [dbo].[DimProduct] P on S.ProductKey=P.ProductKey
group by P.Color

If you have a primary key on the dbo.DimProduct.ProductKey, the optimizer knows that the inner join cannot produce more rows than exist in the fact table because for each row from the fact table we can only find at most one row in the dimension table. Without the primary key (which is the situation in the PDW) the optimizer has to consider density information from the statistics. This will work pretty well, but let’s say that for a larger dimension the statistics gives something like: “for each row from the fact table, you might be getting 1.3 rows from the dimension table”. Again, nothing much happened here. But assuming you have many dimensions, the effect may grow exponentially. With 8 dimensions and 30% over guess you would end up at 1.38 = 8.16. So instead of querying for example a billion rows, the optimizer thinks that we’re about to query 8 billion rows. This could have a huge effect on the query plan. If you encounter such issues, one option could be to convert the dimensions in the data source view to query binding. For example, the query for the product dimension may look like this:

select
ProductKey,
Min(Color) Color,
Min(EnglishProductName) EnglishProductName,
Min(ListPrice) ListPrice

from
DimProduct
group by ProductKey

Since ProductKey is actually a logical primary key, rewriting the dimension query this way gives the same result as

select ProductKey, Color, EnglishProductName, ListPrice from DimProduct

but because of the group by operation, the optimizer now know for sure, that the ProductKey is unique thus giving a better estimate of the resulting rows.

Again, I’m still investigating these cases and the benefit of the query rewrite, but if you encounter performance issues, this may be one option to try.

 

Aggregation design and proactive caching

Since ROLAP partitions rely on indexed views in the data source, you cannot use ROLAP aggregations on the PDW 2012. However, from the query response we got so far, there might not be much need for aggregations at all if your data is stored as a clustered columnstore index. If you need aggregations, you could try HOLAP aggregations. We haven’t tried this so far, but I’m planning to do more investigation.

Proactive caching is currently only supported in polling mode (not with the trace mechanism).

 

Since we’re still in the process of adjusting our PDW ROLAP environment I’m going to write more posts with tips and tricks, so stay tuned. Actually, we’re investigating these topics:

  • How does ROLAP perform with role based security in the SSAS cube?
  • How does ROLAP perform with many users?
  • How does ROLAP work with more complicated MDX calculations involving PARALLELPERIOD, aggregates (AGGREGATE, SUM, MIN, MAX etc.) over dynamic sets etc.? Which MDX calculations are performing better / for which calculations shall we still use MOLAP?

Also some interesting recommendations (for example regarding the “count of rows” aggregation or the proper setup of your Analysis Services server) can be found in the SQLCAT Whitepaper Analysis Services ROLAP for SQL Server Data Warehouses.

Sunday, February 9, 2014

Removing duplicate rows in large tables

PDW 2012 | SQL Server 2012

Removing duplicate rows in large tables can be a challenging operation. As long as the table does not contain too many rows, several approaches will lead to the desired result. Recently, I had a table of about 2.5 billion rows and 125 columns. The table is located on a Microsoft Parallel Data Warehouse (PDW), so I had to perform the cleaning of the table using distribute SQL (DSQL).

For simplicity, let’s assume the table has columns key1, key2, key3, key4, key5 and col1, … col120. The table is distributed by key1. The rows should be unique regarding the five key columns but this is not guaranteed, since the key columns do not form a database enforced primary key.

Such tasks are usually easy to solve on a PDW using a CTAS (create-table-as-select) statement. For my example I used the following statement:

create table new_table
with (distribution = hash (key1), clustered columnstore index)
as select * from
(select *
, row_number() over (partition by key1,key2,key3,key4,key5 order by col1) RowNr
from my_table
) X
Where X.RowNr = 1

One remark regarding the window function (row_number over). Make sure, that the distribution key is the first column in the partition list, so the computation of the row number can happen on each compute node separately without re-distributing the data. And another remark regarding the “select * …”. Usually it’s not a good style of writing SQL code but in this case we really want to transfer all columns and the DSQL pre- processor expands the * to the corresponding field list before sending the query to the compute nodes (this can be seen in the query log).

Running this query on the full set of 2.5 billion rows took about 1 hour 15 minutes to complete.

Analyzing the query details showed that the sub-query for finding the duplicate rows was easy to solve for the PDW. Wrapping the sub-select in a count returned after about 1 minute giving the exact number of unique rows:

Select count(*) from
(select *
, row_number() over (
partition by key1,key2,key3,key4,key5
order by col1) RowNr
from my_table
) X
Where X.RowNr = 1

So most of the time for this query originates from copying data from one table to another. While usually it’s a great idea to CTAS into a new table, it could be a bad choice for this scenario when you have a lot of rows with many columns and relatively few duplicates.

Here is the approach, I finally used:

  1. Write the keys of the duplicate rows to a new temporary table #DupKeyStore
  2. CTAS from the original table to another temporary table #UniqueRowStore only the rows that also exist in the #DupKeyStore by using the approach from above.
  3. Delete all rows from the original table that match the keys in the #DupKeyStore table. The original table does not contain any duplicate rows anymore - the rows that had duplicates are removed.
  4. Insert the rows of the table #UniqueRowStore back into the original table

Here is the code in more detail.

Step 1

create table #DupKeyStore
with (LOCATION = USER_DB, distribution=hash(key1))
as
select
key1, key2, key3, key4, key5
, convert(nvarchar,key1)
+'-'+convert(nvarchar,key2)
+'-'+convert(nvarchar,key3)
+'-'+convert(nvarchar,key4)
+'-'+convert(nvarchar,key5) KeyLookup
from
(select key1, key2, key3, key4, key5
from mytable
group by
key1, key2, key3, key4, key5having count(*) > 1
) X

The query took 1 minute to complete on the full data set of 2.5 billion rows. The ‘KeyLookup’ column looks strange but I will explain this later in the process.

Step 2

create table #UniqueRowStore
with (LOCATION = USER_DB, distribution=hash(key1))
as
select * from
(select Q.*,
row_number() over (partition by
Q.key1, Q.key2, Q.key3, Q.key4, Q.key5
order by Q.key5) RowNr
from mytable Q
inner join #DupKeyStore K
on
Q.key1=K.key1
And Q.key2=K.key2
And Q.key3=K.key3
And Q.key4=K.key4
And Q.key5=K.key5
) X
where RowNr=1

Again, this query took about 1.5 minutes to transfer the duplicate rows into a new table while removing duplicates. The main performance gain here is that the query does not need to transfer all rows, but only the duplicate ones. Another thing to notice here is that I’m using a heap for this intermediate table (not a columnstore index).

Step 3

SET XACT_ABORT ON

BEGIN TRANSACTION

delete from mytable
where
convert(nvarchar,key1)
+'-'+convert(nvarchar,key2)
+'-'+convert(nvarchar,key3)
+'-'+convert(nvarchar,key4)
+'-'+convert(nvarchar,key5)
IN (select KeyLookup from #DupKeyStore)

Ok, this query needs some remarks. First, we use the concatenated KeyLookup-column from above. I must admit that this looks somewhat ugly. It would be better to use a join in the select but this is currently not supported on the PDW. Another option would be to use EXISTING(select….) in the where condition which would remove the ugly concatenation of the keys. However, during my tests, this approaches took a lot more time to complete. So I ended up with the statement from above.

For my example, the query took about 1 minute to complete.

Step 4

insert into mytable
select key1, key2, key3, key4, key5, col1, col2, … col120
from # UniqueRowStore
COMMIT;

The insert was quite fast, only taking 30 seconds.

At a total of 4 minutes the 2.5 billion rows were cleaned from duplicates. Compared to the 1 hour 15 minutes using the first approach, this is a huge performance boost. And having more compute nodes would result in a much shorter query time for this task.

Before extending this solution to even more rows, consider the following topics:

  1. Check the number of duplicates first (table #DupKeyStore). If there are too many duplicates (I would say more than 5% of the total rows) consider the CTAS operation instead of the delete/insert operation.
  2. The delete/insert operations require tempdb space. For a larger amount of rows, these statements should be split to batches having one transaction for a bunch of keys.
    Since delete top(nnn) is not supported on PDW, SET ROWCOUNT does not work either here and you also don’t have a @@ROWCOUNT a good approach is add “cluster-column” (e.g. row_number() modulo something) into #DupKeyStore and to use this key for splitting into batches. Remember to wrap the delete and the insert into a single transaction. See this post by Stephan Köppen for details.
  3. CTAS operations writing to clustered columnstore index tables perform better using a higher resource class. This could also be considered when using workarounds like the one from above.

So, that was it for today’s post. Ah, no, I’ve forgotten something:

drop table #DupKeyStore;
drop table #UniqueRowStore;