Search This Blog


Sunday, November 30, 2014

SQL Server Data Files on Azure

Azure | SQL Server 2014

With SQL Server 2014 it’s easy to move database files to the Azure Blog storage even if the SQL Server runs on premise. Azure Blob storage offers reliable, cheap and high available storage, which could be useful for “cold” data for example.

However, configuration is a little bit tricky, so I’m going to walk through this process step by step.

1. Create an Azure Blob store account and a container

Log into Azure and create a new storage account. For my example, I’m using “db4” as the name as shown below:


Next, I’m going to create a blob store container, which I name “data” here:


In order to access the container, we need the URL to the container ( in my example) and the storage key. The key can be obtained by clicking on “Manage Access Keys” on the bottom of the screen:


You can copy the key to the clipboard by clicking on the icon right besides the Primary Access Key box.


For the next task I’m using Windows Azure Storage Explorer (download here). Here you can add your storage account by pasting the access key into the storage account key input box:



2. Create a Shared Access Signature for the container

In Azure Storage explorer, select the container (data) and click on ‘Security’:


This brings up the following dialog. Make sure to select the permissions list, delete, read and write. After clicking on ‘Generate Signature’ a shared access signature is created. Copy this signature to the clipboard.



3. In SQL Server: Create a credential for the blob container

In SQL Server we’re using the create credential statement to create a credential for the blob store. Make sure to replace the secret key with the generated shared access signature from the last step (I just obfuscated the key by overwriting part of the key with ‘x’):

SECRET = 'sv=2014-02-14&sr=c&sig=c%2Fxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx3%3A00%3A00Z&se=2014-12-31T23%3A00%3A00Z&sp=rwdl'

If you like, you can check the credentials with “select * from sys.credentials”:



4. In SQL Server: Create a database that uses the blob container

The next step is simple. We just need to create a database using the container as its storage:

( NAME = testdb_dat,
( NAME = testdb_log,

You can create tables and load data in just the same way as you would do with a local database file. Azure Storage Explorer lists the database files that are created:



5. Optional: Register the blob store in SQL Server Management Studio

You can register the blob store in SQL Server Management Studio by creating a connection to Azure:


The “Access Key” is the key we created in the first step and can simply be copied into the account key field:


After connecting to the Azure blob store, Management Studio shows our container together with the database files:


Of course, when placing database files on Azure, a connection is needed to the blob store. If you don’t have this connection, you will not be able to access the database:




With SQL Server 2014 it is easy to put data files on an Azure storage account even for an on premise SQL Server. Use cases include

  • store data that is not heavily queried
  • store data that you want to secure in a geo-redundant way
  • enhance the local storage of a SQL Server
  • perform a single table backup to the cloud
  • … and many more

Sunday, November 9, 2014

Combining multiple tables with valid from/to date ranges into a single dimension

Dimensional modeling

Tracking historical changes within a dimension is a common task in data warehousing and well covered by Ralph Kimball’s slowly changing dimension (SCD) methods. In short, the SCD methods proposed by Ralph Kimball assume, that the source system (for example the ERP system) doesn’t keep historical versions of its data records, so changes need to be detected at the time when data is loaded into the warehouse. To keep historical values, versions of the master data records are created to memorize each state of the original record together with a valid from/to timestamp so that fact data can be joined to corresponding dimension data. But the valid from/to dates are usually not a good idea for joining fact data to the associated dimensions because this would result in range lookups (ETL) or date range (between) joins (in SQL or ELT). The surrogate key concepts offers a good solution here, by assigning a unique key (the surrogate key) to each version of a record. Now, this key can be used as a direct inner join from the fact table to its dimensions. This approach moves the time consuming process of resolving date ranges from query time to data loading time, so it has to be performed only once.  Query performance now benefits from the simplified link structure between the tables.

However, there may be some cases, where you find valid from/to dates in the original source system. In this case, the historical values are provided by the source system and usually it’s not necessary for the data warehouse to track the changes. While this sounds to be much more simple than the case with missing validity dates, it’s usually a challenging situation, especially when past records (and their valid from/to dates) may be modified. For example, a given date range could be split or merged or the from and to dates may shift. In either case, the surrogate keys of some fact rows would point to the “wrong” dimension record afterwards. So, for these cases you will need to periodically reload parts of your data warehouse (for example the last three months) or in some rare cases track the changes and adjust the surrogate keys of the fact tables. I’m saying rare cases as update-operations on fact tables that are tuned for high volume bulk loads and bulk queries are usually not a good idea, so you may want to implement a partition-wise recreation of the fact table (partition switch operations) which adds some complexity to the overall workload management.

However, after this intro my post today is about a situation where you have several linked tables in the source system, all with a valid from/to date. You may find this situation for example in SAP’s human resources tables where the properties of an employee are stored in so called info types which are independently versioned by valid from/to date ranges. In this post, I’m using a much more simplified scenario with the following 4 tables:


Organizational Unit (OrgUnit)


Company Car (Car)

The tables reflect a very simple human resources model of four tables, a base table Employee and three detail tables, all joined by the EmployeeNo-field. Each table may contain multiple versions of data and therefore each table has valid from/to fields to distinguish the versions. In my example I’m using the approach of an including ValidFrom and an excluding ValidTo. If you take a look at the first two rows of the OrgUnit table for example, this means that employee 100 was in the organizational unit “Delivery” from Jan 1, 2000 until December 31, 2013 and then starting with January 1 2014 in “PreSales”.
For each of the four tables, EmployeeNo together with ValidFrom forms a primary key.

One potential problem with such data is that since valid from/to are delivered from the source system, we need to make sure that these date ranges do not overlap. There might be scenarios where you need to deal with overlapping date ranges (for example, an employee may have none, one or many phone numbers at a given point in time, for example a cell phone and a land line). If you need to model such cases, many-to-many relations between fact and dimensional data may be a solution or you could move the information from the rows to columns of the new dimension table.  But for this example, I will keep it simple, so we don’t expect overlapping data in our source tables.

However, it’s always a good idea to check incoming data for consistency. The following query for example checks if there are overlapping date ranges in the Employee table by using window functions to retrieve the previous and next date boundaries:

select * from (
    , [ValidFrom]
    , [ValidTo]
    , lag([ValidTo],1) over (partition by [EmployeeNo] order by [ValidFrom]) PrevValidTo
    , lead([ValidFrom],1) over (partition by [EmployeeNo] order by [ValidFrom]) NextValidFrom
from Employee
) CheckDateRange
where (PrevValidTo is not null and PrevValidTo>ValidFrom) or (NextValidFrom is not null and NextValidFrom<ValidTo)

Please note, that this query does not check for gaps but only for overlapping date ranges in a table. If you like to detect gaps too, you’ll need to change the > and < in the where condition to a <>, i.e.

…where (PrevValidTo is not null and PrevValidTo<>ValidFrom) or (NextValidFrom is not null and NextValidFrom<>ValidTo)

Running this check on all the four tables from above shows that the data is consistent (no faulty rows returned from the query above).

Next, we can start to combine all of the four tables to a single dimension table. Let’s first show the final result:


The information of the four tables is now combined into a single table. Whenever an attribute changes this is reflected by the valid from/to date range. So for example, the first change for employee 100 was the company car at June 1, 2008.

So, how do we get there? At first, as the resulting valid from/to dates need to reflect all date ranges from all of the four tables, I start by collecting all of those dates:

ValidDates as
select EmployeeNo, ValidFrom as Date from Employee
select EmployeeNo, ValidTo from Employee
select EmployeeNo, ValidFrom from OrgUnit
select EmployeeNo, ValidTo from OrgUnit
select EmployeeNo, ValidFrom from Location
select EmployeeNo, ValidTo from Location
select EmployeeNo, ValidFrom from Car
select EmployeeNo, ValidTo from Car

This gives a list of all valid from/to-dates by employee from all of the four tables with duplicates being removed (since I used a union, not a union all). This is how the result looks like:


Next, I’m using this information to build the new valid from/to date ranges by using a window function to perform a lookup for the next date:

ValidDates as …
ValidDateRanges1 as
select EmployeeNo, Date as ValidFrom, lead(Date,1) over (partition by EmployeeNo order by Date) ValidTo
from ValidDates
ValidDateRanges as
select EmployeeNo, ValidFrom, ValidTo from ValidDateRanges1
where ValidTo is not null


Please note, that we already have the 10 resulting rows from the final result (see above) with the correct date ranges but without information from our four tables yet. So, now we can join the four tables with the date range table making sure to include the proper date range in the join condition. Here’s the resulting query:

ValidDates as …
, ValidDateRanges1 as …
, ValidDateRanges as …

    , E.Name
    , E.EmployeeID
    , isnull(OU.OrgUnit,'unknown') OrgUnit
    , isnull(L.Building,'unknown') Building
    , isnull(L.Room,'unknown') Room
    , isnull(C.CompanyCarId,'no company car') CompanyCarId
    , D.ValidFrom, D.ValidTo
from Employee E
inner join ValidDateRanges D
on E.EmployeeNo=D.EmployeeNo and E.ValidTo>D.ValidFrom and E.ValidFrom<D.ValidTo
left join OrgUnit OU
on OU.EmployeeNo=D.EmployeeNo and OU.ValidTo>D.ValidFrom and OU.ValidFrom<D.ValidTo
left join Location L
on L.EmployeeNo=D.EmployeeNo and L.ValidTo>D.ValidFrom and L.ValidFrom<D.ValidTo
left join Car C
on C.EmployeeNo=D.EmployeeNo and C.ValidTo>D.ValidFrom and C.ValidFrom<D.ValidTo

Since we made sure that no date ranges are overlapping within a single table, the joins can only return at most one row per employee and date range. To deal with gaps (for example in the car table) I used the isnull-function here to replace the gaps with a meaningful value (for example ‘no company car’ or ‘unknown’).

One final remark: In most cases, the source tables may contain many more fields that are not relevant for the data warehouse. However, the valid from/to information reflects changes within these fields too. The above approach would result in more than necessary versions in this case. However, as long as your dimension does not get too big, this is not really bad. On the opposite, if you later decide to include more information from the source tables, you already have properly distinguished versions for this information so you do not need to correct fact rows afterwards. This could even make it a good idea to include valid from/to dates from other associated tables even if no other information from those tables is yet being used in the data warehouse.

But if your dimension gets too big with this approach, you could always ‘clean’ unnecessary version using a simple group-by select with min(ValidFrom) and max(ValidTo) grouping by all other columns.

So, this showed how to combine multiple tables into a single dimension. As mentioned above, you still need to create surrogate keys and if you cannot eliminate the need for past data changes, you will also need to handle.

Sunday, September 14, 2014

Performance optimizations when loading many small files in SSIS

SQL Server Integration Services (SSIS)

In general, reading text files from SQL Server Integration Services (SSIS) is not a complicated task. The flat file source offers a user friendly interface to deal with separators, header lines and code pages/unicode. It can even determine the best data type for each column by scanning sample rows from the text file. And if your flat file is in XML-format you can use the XML source component in SSIS to read the file’s contents.

In many cases however you will not just have a single file but a directory containing many files instead. SSIS offers the for-each loop container in this case, to create a loop over all those files. The for-each loop container also has a friendly user friendly interface, so you can also easily solve this task in SSIS.

In this post I’d like to discuss a scenario with many small files to import and I will compare the for-each loop approach with a single data flow approach. Therefore, the task was to read about 2,100 files in JSON format into a single SQL Server database table. Each file has up to 80 KB in size containing from 1 to about 350 rows of data with an average of about 310 rows per file. So, in total I had to import about 650,000 rows of data from about 2,100 files. Doesn’t sound like a big deal so far.

First I created a for-each container with a single data flow:


The for-each container’s type is set to a “Foreach File Enumerator” scanning all files from a given directory:


The data flow simply reads the current file (I’m not going into details about the JSON-file here but some library like JSON.Net will do), does some minor changes (derived column) and writes the results into a SQL Server database table:


Running the package imported all of the files into my data table BUT… it took much longer than expected. In fact, it took 8 minutes. So whatever my expectation was before running the package, this was way too slow. First I checked if I made some mistakes. The OLE DB destination was set to use a fast table load with a table lock not checking any constraints. So, this was ok. The destination table was a heap with no primary key, so there wasn’t a problem with index reorganization.

Checking the progress log revealed that validation, pre- and post execute events are executed for each file. And since each file contained only a few rows, very small batches were committed in each loop causing the bad performance. Also, sending small batches to a table may be a bad idea depending on your table geometry. For example, when using clustered columnstore index tables, sending small batches results in asynchronous compression cycles as explained here.

If the flat files are actually CSV files, the best approach is to use the MULTIFLATFILE Connection Manager. Actually I must admit that I wasn’t aware it existed, until a colleague showed it to me. So here are the instructions to find this connection manager: When you right-click in the connection section of your package, a dialog appears to choose the connection type. Click on “New Connection…” here.


In the following dialog you can choose the MULTIFLATFILE Connection Manager. It is configured in exactly the same way as the standard flat file connection manager, but now you can specify multiple files or directories to scan.


But since my source files we’ not in CSV file format, I had to go for a different approach here. I replaced the for-each loop container with C# code inside the script component from above. Here is the corresponding code I used:

public override void CreateNewOutputRows()
    String path = "c:\\temp\\JSON_Import";
    foreach (String filename in Directory.EnumerateFiles(path))
        // process single file

Again, I’m not going into details about the actual code for importing the JSON file here, but the code above shows how simple a for each loop can be implemented within a script (of course you will want to add some error-handling and use a package variable for the import folder instead of the constant string here). The remaining parts of the data flow were left unchanged.

This time, importing all of the 2,100 files took 10 seconds, so this approach was about 48 times faster, than the for-each loop container.





In SSIS, writing data to a database table using a data flow gives the best performance if you have a large number of rows. However, importing many small files from a directory using the for-each loop container results in the opposite: many inserts with just a few rows each. If you encounter performance degradations in such a scenario, using the MULTIFLATFILE connection manager or, if not possible, converting the for-each loop container and the file read operation itself into a script task, may result in a much better performance. To improve performance even more, you could also try to parallelize the script tasks (for example the first script importing files 1, 3, 5 … and the second one importing files 2, 4, 6, …).

Sunday, August 31, 2014

Create table as select (CTAS) with “not null” column

PDW v1 | PDW 2012 | APS

CTAS (Create Table As Select) is a common way on the Parallel Data Warehouse (PDW) to transform one table into another table for example for calculations or for ELT (Extract Load Transform) processes.

The general syntax is quite simple and explained in the PDW help file:

CREATE TABLE [ database_name . [ dbo ] . | dbo. ] table_name
        [ ( { column_name } [ ,...n ] ) ]
    WITH (
        DISTRIBUTION = { HASH( distribution_column_name ) | REPLICATE }
            [ , <CTAS_table_option> [ ,...n ] ] 
    AS <select_statement>

<CTAS_table_option> ::=
    | CLUSTERED INDEX ( { index_column_name [ ASC | DESC ] } [ ,...n ] )
    | PARTITION( partition_column_name RANGE [ LEFT | RIGHT
        FOR VALUES ( [ boundary_value [,...n] ] ) ) ]

<select_statement> ::=
    [ WITH <common_table_expression> [ ,...n ] ]
    SELECT <select_criteria>


For example, a simple CTAS statement to copy the contents of one table (FactSales) to another table (FactSalesTmp) may look like this:

create table FactSalesTmp
with (distribution = hash(DateKey))
as select *  from FactSales

As you can see from the CTAS syntax definition above it’s not possible specify column properties like NULL/NOT NULL or constraints. In the help file there is a note about this:

You cannot specify NULL | NOT NULL for the columns in the CTAS statement; the nullability property is derived from the columns and expressions in the SELECT results.

However, if you’re using CTAS to fill an intermediate stage table used for partition switching operations the switch out table has to have exactly the same definition as the target table and this also means that the NULL/NOT NULL setting on the column has to be identical. So how is the nullability property derived from the expressions in the select results? To show this behavior, I’m using a simple fact table with the following definition:

create table FactSales (
      DateKey int not null
    , ProductKey int not null
    , StoreKey int not null
    , Quantity int not null
    , Amount decimal (13,4) not null
    , Costs decimal(13,4)
) with (distribution = hash(DateKey))

Using the simple CTAS statement from above (the one copying FactSales to FactSalesTmp) preserves the nullability of all columns. As long as you refer to existing table columns, the nullability is preserved from that columns.

But what about calculations? Let’s try the following CTAS statement:

create table FactSalesTmp
with (distribution = hash(DateKey))
as select *
, Amount*0.8 AS StandardCosts
from FactSales

I simply added another column here using a simple calculation. You can check the resulting table structure by choosing the context menu ‘View Code’ in Data Tools or by running the following query:

select name, is_nullable from sys.columns where object_id=object_id('FactSalesTmp')


What you see, is that the calculation (StandardCosts) is understood to be nullable by the PDW. How can we mark this column as not null the CTAS statement?

Rewriting the CTAS from above using coalesce doesn’t solve the problem:

create table FactSalesTmp
with (distribution = hash(DateKey))
as select *
, coalesce(Amount*0.8,0) AS StandardCosts
from FactSales


However, using isnull does the trick:

create table FactSalesTmp
with (distribution = hash(DateKey))
as select *
, isnull(Amount*0.8,0) AS StandardCosts
from FactSales



The reason for this behavior is the different handling of the data type for coalesce and isnull. So, if you want to have an expression being marked as ‘not null’ in a CTAS statement, use the isnull-function.

Sunday, August 10, 2014

The 'KeyColumns' #0 has NullProcessing set to 'UnknownMember', but the dimension doesn't have UnknownMember set to 'Visible' or 'Hidden'

SQL Server 2005-2014


By default, SSAS provides an automatically created member with the name ‘unknown’ for each dimension. This member is intended to be the home for all facts that don’t fit to a real member (provided from the data source). In the example above, fact data that does not match any of the listed product categories could be mapped to the unknown-element.


I’m saying ‘could’ and not ‘is’ because the rules for mapping fact data to the unknown-element can be configured in the dimension properties.

But using this mechanism has certain drawbacks:

  • Processing time increases a lot if one row is encountered which has to be mapped to unknown
  • Only one text (for example ‘unknown’, can be configured) for all attributes in the dimension
  • Cases are hard to find since you don’t see this mapping in the underlying data warehouse tables


In a good data warehouse design, ETL takes care of the correct mapping of fact data to its dimensions by using surrogate keys. Each join is then an inner join. In order to do so, dimension tables usually contain a row for the unknown element. Frequently, the surrogate key –1 is used for this row.

But following this best practice results in the SSAS dimension showing two elements for ‘unknown’: The dimension entry and the automatically created entry.


So, why does SSAS dimension have this build-in unknown element by default? If we build almost all SSAS cubes based on a good data warehouse design where the dimensions maintain their own unknown element, there is no need for an automatically created unknown element anymore. But since SSAS cube’s wizard is intended to work with most types of data structures, the unknown element is there by default. Without having ETL enforced surrogate keys you just cannot be sure, that every fact row maps to its dimensions.

So, as explained above, we want to remove this default unknown element in almost all SSAS cube development projects. This can be easily done in the properties dialog of the dimension:


There are four available options for the unknown member:

visible The unknown-member of the dimension exists and is visible
hidden The unknown-member of the dimension exists and is hidden
none The unknown-member of the dimension does not exist
automatic null The unknown-member of the dimension exists und is visible, if there are violations of the referential integrity (fact keys not found in in dimension).

Again, if we take care of the surrogate keys in the ETL process, there is no need for a dimension unknown element at all. So, the best option is, to disable it (UnknownMember set to none).

However, because of other default settings, you’re getting the following error when trying to deploy your SSAS model afterwards:

The 'KeyColumns' #0 has NullProcessing set to 'UnknownMember', but the dimension doesn't have UnknownMember set to 'Visible' or 'Hidden'

If you’re getting this error for the first time, it might not be clear, where to fix it, especially since there are two changes that need to be made:


1. Adjusting the dimension key attribute

If you look at the dimension, you’ll notice the red hash-line below the key attribute of your dimension.


In order to fix this, you’ll need to open the properties of that attribute. Now navigate to the key columns setting and expand the view for each of the columns (since you may have more than one column bindings for the attribute’s key) as shown in the following screenshot:


Here you can set the NullProcessing option to “Error”. The default is “UnkownMember” but since we just disabled this, this causes the error.

Remember to do this for each key column of this attribute.


2. Adjust null processing in the dimension usage.

The second place to modify is the dimension mapping. Therefore open the cube and go to the dimension usage tab. You’ll notice the red hash-line at the attribute (in my example, the Product ID):


In order to fix this, click the button near to the attribute (the one labeled with “-“) to open this dialog:


Click advanced to edit the properties of the mapping:


In the lower part of the dialog, you can set the “Null Processing” from “UnknownMember” to “Error”.

After this change you should be able to deploy the cube again.

Sunday, July 6, 2014

How to collect your customers’ data without scaring them off

Business Intelligence
In a recent survey, Abe Selig from Software Advice studied the perception of users about companies collecting their data. Obviously we’re currently in a dilemma here: On the one hand companies are gathering more and more data in order to be competitive. Where in the past marketing and promotions, for example were based upon “averaging” the customer needs (“our customers like to buy big cars”) we’ve now moved to more individualized and more targeted promotions and product recommendations. The better we can address our customer, the better the chances are for selling our product or for building up a long term customer relationship. And since customers have individual differences, companies need to address them individually. This requires knowledge (data) about the customers. The customers on the other hand are sensitive about their data being collected and because of the recent news about stolen passwords, hacking of major services (for example the recent Heartbleed bug in OpenSSL) and the NSA eavesdropping they might even be scared if someone collects their data. In order to gain a better picture of this situation, Software Advice asked a random sample of 385 respondents in the US what they think about their data being collected.
"Our survey found that the majority of respondents were in favor of stronger laws and regulations governing the way companies use CRM or business intelligence platforms to collect data about their customers. Just as the experts who were interviewed in the piece explained, I think a lot of this stems from previous bad experiences people have had or bad experiences they've heard about."
-- Abe Selig, Business Intelligence researcher at Software Advice

The results are not much surprising, however they clearly show the dilemma mentioned above. About three quarters of the respondents assume or suspect companies to collect customer individual data. And if asked for more legal regulations, almost three quarters agree that there should be stronger laws and regulations in place to protect privacy.
And also not surprising, younger people (age 25-35) tend to be more open about their data being collected than older people (55-64). But in general, asking for the kind of collected information the respondents would be comfortable with, shows that this will continue to be a very controversial topic in the future. While gathering data about the current location would only be acceptable for about 5% of the respondents, even likes/dislike would only be accepted by about 12%.
ALL respondents comfort level of data collection

"The numbers show there's a clear split between businesses, which want to collect customer data and leverage it, and their customers, who are deeply concerned about their privacy. But I also think that transparency is key here, and our survey backs that up too. We found that if companies are more upfront and open about how and why they use customer data, the fear factor is greatly diminished."     
-- Abe Selig, Business Intelligence researcher at Software Advice

However, the survey also clearly shows what can be done to prevent customers from being scared off: Customers would feel a lot less bothered about their data being collected if they are informed about the data that is collected and why it is collected. If companies inform customers about the collected data and if they use it only to improve the customer’s overall shopping experience, acceptability could be improved. Most people I know like the recommendation function in modern web shops while it feels a little bit strange to see the product you’ve just searched for in a web shop on a totally different site’s advertising. How do they know I’ve searched for this product? So, transparency is important and the customers’ benefit has to be clearly noticeable.
You can find the complete survey here:

Sunday, June 8, 2014

Practical table partitioning on the Parallel Data Warehouse

APS/PDW 2012

This post is about table partitioning on the Parallel Data Warehouse (PDW). The topic itself is actually quite simple but there are some differences between the SMP SQL Server compared to the Parallel Data Warehouse.

On the SMP SQL Server table partitioning was important with large tables for two reasons:

  1. Query performance
  2. Workload management


For the SMP SQL Server, table partitioning allows queries and other table operations (like index rebuild) to be performed on multiple cores. Therefore table partitioning was done to improve query performance. However, the PDW architecture already stores larger tables (so called distributed tables) on each compute node by distributing it to multiple tables (so called distributions) on separate files in an optimal way for the available cores (currently 8 distributions per compute node). Therefore, when working on a PDW query performance usually isn’t the main reason for us to use table partitioning. But the second reason, workload management, still applies on the PDW. For example, when loading data it’s often useful to first load into a stage table, merge new and old data into a new partition and then switch that partition to the final table. So partitioning is still important on the PDW.

Creating a partitioned table on the PDW is a little bit easier compared to the SMP SQL Server as you don’t need (and don’t see) the partition schema or partition function. The following statement is an example of creating a partitioned table:

CREATE TABLE [dbo].[MyTable1] (
    id int
WITH (DISTRIBUTION = replicate,  PARTITION (id range right FOR VALUES (10,20,30,40)));

In this query and throughout the remaining blog post, I’m only referring to a range right partition function. This is my preferred option as I think it’s more intuitive although both ways are almost identical and all partitioning is usually handled by automated tasks. So it isn’t really important. However, range right means that the partition boundary is in the same partition as the data to the right of the boundary (excluding the next boundary). So for a range right partition function, the left boundary is included while the right boundary is not, for example a partition with boundaries 10 and 20 contains data with values greater or equal to 10 and less than 20 (for integer values: 10, 11, 12, … 18, 19).

By specifying four boundaries in the create table statement from above, we have actually created five partitions as shown in the following table:

Partition Number Range From Range To Formula for id
1   10 id < 10
2 10 20 10 ≤ id < 20
3 20 30 20 ≤ id < 30
4 30 40 30 ≤ id < 40
5 40   40 ≤ id

On the PDW, the partition number is important for switch operations as well as for index rebuild operations. For example, in order to perform an index rebuild on partition 3 you write run this code:


The product documentation (aps.chm) contains a query to return “…the partition numbers, boundary values, boundary value types, and rows per boundary for a partitioned table”:

SELECT sp.partition_number, prv.value AS boundary_value,
lower( AS boundary_value_type, sp.rows
FROM sys.tables AS st
JOIN sys.indexes AS si
    ON st.object_id = si.object_id AND si.index_id <2
JOIN sys.partitions AS sp
    ON sp.object_id = st.object_id AND sp.index_id = si.index_id
JOIN sys.partition_schemes AS ps
    ON ps.data_space_id = si.data_space_id
JOIN sys.partition_range_values AS prv
    ON prv.function_id = ps.function_id
JOIN sys.partition_parameters AS pp
    ON pp.function_id = ps.function_id
JOIN sys.types AS sty
    ON sty.user_type_id = pp.user_type_id
        AND prv.boundary_id = sp.partition_number
WHERE st.object_id =
    (SELECT object_id
     FROM sys.objects
     WHERE name = 'MyTable1')
ORDER BY sp.partition_number

Let’s try the query with our table from above. Here is the output:


Some observations may be confusing here. The first thing is that each partition is reported to contain 200 rows although we have just created the table and therefore expect the table to be empty. However, the reported rows are taken from the sys.partitions system view. In the documentation for the sys.partitions view you find the following remark about the number of rows:

Approximate average number of rows in each table partition. To calculate this value, SQL Server PDW divides the number of rows in the table by the number of partitions in the table.

SQL Server PDW uses statistics, which might be out-of-date, to determine the total number of rows. The statistics are from the most recent run of UPDATE STATISTICS on the table. If UPDATE STATISTICS has not been run on the table, the statistics won’t exist, and SQL Server PDW will use 1000 as the default total number of rows. To display the number of rows in each partition within each distribution, use DBCC PDW_SHOWPARTITIONSTATS (SQL Server PDW).

So, the number of rows is just estimated here, and since we don’t have created statistics for the table, PDW assumes the table to contain 1000 rows. But wait, 1000 rows divided by 4 partitions gives 250, not 200, right? Well, remember that we actually have 5 partitions although the meta data query from above only lists 4. I’m getting back on this soon.

Statistics are easy to create, so let’s do this first:

create statistics stat_id on MyTable1(id)

Here is the result when running the meta data query again:


So, now the number of rows seems to be correct. But be carefull: This still is only an approximation and you cannot expect the approximation to be accurate.

The other thing to be puzzled about with the output of the meta data query may be, that it only reports 4 partitions although we first figured out, that there should be 5 partitions. And also the boundary value may be confusing. For partition number one, we found out that it contains all data rows with id less than 10 (not equal). So the boundary value from the output is the excluded right boundary of our range right partitioning – confusing.

Stephan K√∂ppen already posted some useful queries for the PDW (see his post here). Using his partition query gives a much better result. I made some minor changes to the query and for this blog post, here’s the version I’m using here:

create table #Partitions
with (LOCATION = USER_DB, distribution=replicate)
p.partition_number PartitionNr
, cast(coalesce(lag(r.value,1) over (order by p.partition_number),-2147483648) as int) RangeFromIncluding
, cast(coalesce(r.value,2147483647) as int) AS [RangeToExcluding]
FROM sys.tables AS t
JOIN sys.indexes AS i ON t.object_id = i.object_id
JOIN sys.partitions AS p ON i.object_id = p.object_id AND i.index_id = p.index_id
JOIN  sys.partition_schemes AS s ON i.data_space_id = s.data_space_id
JOIN sys.partition_functions AS f ON s.function_id = f.function_id
LEFT JOIN sys.partition_range_values AS r ON f.function_id = r.function_id and r.boundary_id = p.partition_number
WHERE i.type <= 1 

create table #PartitionData
with (LOCATION = USER_DB, distribution=replicate)
select id, count(*) [rows] from MyTable1 group by id

-- show partitions and number of rows
select PS.PartitionNr, PS.RangeFromIncluding, PS.RangeToExcluding, coalesce(Sum([rows]),0) [rows]
from #Partitions PS left join #PartitionData GT on PS.RangeFromIncluding<= and PS.RangeToExcluding>
group by PS.PartitionNr, PS.RangeFromIncluding, PS.RangeToExcluding

drop table #Partitions
drop table #PartitionData


If you’re only interested in the partitions, the blue part of the query is enough. The query uses the lag windows function to retrieve the lower boundary. The remaining query is used to obtain the exact number of rows for each partition. Please note, that the boundary information resulting from my modifications is only valid for a range right partition function. Here is the result:


As you see, this corresponds exactly to the five partitions from the first table above. The upper boundary of partition 5 should be increased by one to be 100% correct but this would conflict with the maximum integer value. If you like, just return null for the lower boundary of partition 1 and the upper boundary of partition 5 and observe this in the comparison with the existing data.

Also when reading the above query part that is printed in black you should adopt the method for counting the rows per partition to your needs. The method I’m using here proved to work fine for discrete values (integer ids). Since we usually partition by an integer column (for example a data written as yyyymmdd, 20140501 for May 1, 2014) this approach works fine for most of our workloads.

Next thing of interest is the partition number. As I wrote earlier in this post, the partition number is used for partition switch operations or for example for index rebuilds. It’s important to understand that the partition number is always a consecutive range of numbers starting with the number one. Even if you merge two partitions into one, the number is still consecutive.

For example, let’s merge partitions 3 and 4. In the merge partition statement we only need to specify the boundary. In a certain sense, this boundary is removed to form the new partition. In our case, partitions 3 and 4 share the boundary value 30, so the merge statement looks like this:


Here is the result using the modified meta data table from above:


As you can see, the partition number is still consecutive and the partition ranging from 40 to infinity now has the number 4 instead of 5.

If you specify a boundary, that doesn’t exist you’ll get an error message:


A distributed query failed: Node(s):[201001-201002]The specified partition range value could not be found.

Splitting a partition works very similar to a merge. Again, you can think of a split as of inserting a new boundary. For example, let’s split at the value 35 (which is in partition 3):


Here’s the result:


Again, the partition numbering is still consecutive and the former partition 4 now becomes partition 5 because we split partition 3.

Now let’s validate our boundaries by inserting some lines of data:

insert into MyTable1 values(0)


As expected, the value 0 is written to partition 1 as –infinity ≤ 0 < 10.

truncate table MyTable1
insert into MyTable1 values(20)


The value 20 goes to partition 3 as 20 ≤ 20 < 35.

Now we’re going to insert 5 values which should fit the constraints for partition 4:

truncate table MyTable1
insert into MyTable1 values(35)
insert into MyTable1 values(36)
insert into MyTable1 values(37)
insert into MyTable1 values(38)
insert into MyTable1 values(39)


All of these values satisfy the constraint 35 ≤ x < 40 and therefore all the values are written to partition 4.

Ok, these were just some examples to see how data is written to the different partitions of our table.

To complete this post, I finally like to show a partition switch. Therefore we need to create a table of the same structure:

  • same columns, same data types, same nullable settings (take care when creating computed columns in a CTAS statement)
  • same table geometry (heap, clustered index, clustered columnstore index)
  • same distribution method (both tables replicated or distributed by the same key)
  • same indexes and constraints
  • partitioned by the same column (but the partitions itself may differ)


Generating the script for our table after the merge/split operation gives this result:

CREATE TABLE [dbo].[MyTable1] (
    [id] int NULL

Now, replacing MyTable1 with myTable2 we can create a table of exactly the same structure:

CREATE TABLE [dbo].[MyTable2] (
    id int
WITH (DISTRIBUTION = replicate,  PARTITION (id range right FOR VALUES (10, 20, 35, 40)));

We can now switch the 5 rows of data from above. Since they are all stored in partition 4 we can switch them using this statement:

alter table MyTable1 switch partition 4 to MyTable2 partition 4

This is where we finally needed the partition number. We can now check the rows in table MyTable2:

select * from MyTable2


As you can see, all 5 rows are moved (switched) over to table MyTable2.

A common scenario for loading data into the appliance is to first load new data into a stage table of the same layout as the final fact table. Then our meta data query from above helps, by running it against both tables, using the boundaries as the join conditions. This results in the source partition and matching destination partition together with the number of rows in each of them. For example, if your workload contains only new or updated data you can now load the data as follows:

  • If the source partition contains no rows at all, quit
  • If the destination partition is empty switch the source partition directly into the destination partition and quit
  • Otherwise blend/merge the source and destination partition data into a new table with corresponding boundary values (this requires three partitions), then move the existing data out of the destination partition and finally move the merged data into the destination partition.


Summary: Partitioning on the PDW is still useful for workload management but usually not to increase query performance. With the query presented here, it’s quite easy to find the partitions together with their boundaries and number of contained rows. This information can be used to decide about a good partition switching strategy.