Search This Blog

Loading...

Monday, January 26, 2015

T-SQL Last Non Empty

SQL Server 2012 | SQL Server 2014 | PDW/APS 2012

Recently we needed to calculate something like a ‘last non empty’ value in a T-SQL query. This blog post is about the solution we ended up with as an alternative to the classic pattern involving sub-queries.

To illustrate the task let’s first look at some data:

image

The extract shows a contract table with some gaps. The task is to fill the gaps with the last non empty contract of the same id. So, here is the final result:

image

As you can see, apart from the first days for id 2 which don’t have a last value, all gaps have been filled.

In order to fill the gaps using T-SQL window functions, the idea is to calculate the number of steps we need to go back for each null value to catch the corresponding last value. In the following screenshot, I’m showing this value as the last column:

image

For example, for ID 1, Date 2014-07-17 we have to go two rows back (2014-07-15) to get the last non empty value. For the first two dates for ID 2 we also have a lag-value, however there is no corresponding row. Looking at the lag columns it somewhat looks like a row_number over rows with a contract value of NULL. Actually, looking at ID 2 there may be more than one gap (NULL values) so it’s more like a row number over groups of contracts. To determine those groups we need to find changes in the contract over time. So let’s start with this first.

with
C1 as
(select ID, Date, Contract
, iif(isnull(Contract,'') <> isnull(lag(Contract,1) over (partition by ID order by Date),''),1,0) ContractChange
from  [dbo].[Contracts])

select * from C1 order by ID, Date

image

Using the lag window-function I added a new column ‘ContractChange’ that gives 1 whenever the contract changes and 0 otherwise. The next step is to calculate a running total of the column to build up groups of contracts:

with
C1 as…
C2 as
(select ID, Date, Contract, ContractChange,
sum(ContractChange) over (partition by id order by Date) ContractGroup
from C1)

select * from C2 order by ID, Date

image

The new column ‘ContractGroup’ now calculates a value that increments whenever the contract changes. We can now calculate a row_number using the ContractGroup column as the partition:

with
C1 as…
C2 as…
C3 as
(select ID, Date, Contract, ContractChange, ContractGroup,
row_number() over (partition by id, ContractGroup order by Date) LastContractLag
from C2)

select * from C3 order by ID, Date

 

image

And actually, the LastContractLag column here is already the value we need for the lag-function to get to the non-empty value. So here is the final query (including the intermediate calculations from above):

with
C1 as
(select ID, Date, Contract
, iif(isnull(Contract,'') <> isnull(lag(Contract,1) over (partition by ID order by Date),''),1,0) ContractChange
from  [dbo].[Contracts])
,
C2 as
(select ID, Date, Contract, ContractChange,
sum(ContractChange) over (partition by id order by Date) ContractGroup
from C1)
,
C3 as
(select ID, Date, Contract, ContractChange, ContractGroup,
row_number() over (partition by id, ContractGroup order by Date) LastContractLag
from C2)

select ID, Date, Contract
,iif(Contract is null, lag(Contract,LastContractLag) over (partition by id order by Date),Contract) ContractLastNonEmpty
from C3
order by ID, Date

The output of this query is shown above (final result). And again this is a good example of the power of window functions.

Conclusion

In our situation, this solution performed much better than a sub-query approach, but depending on the table layout and the amount of data, other approaches may still be better, so you may want to try different patterns for solving this problem.

Sunday, December 21, 2014

Using OpenStreetMap data in PowerQuery

PowerBI | PowerQuery

This blog post is about using XML data from OpenStreetMap in PowerQuery for example to create a local geocoding database. PowerQuery is just perfect for this purpose as it allows us to interactively shape the input data to the desired format.

First, we need to download some data from OpenStreetMap. You may go to the OpenStreetMap webpage, zoom to the desired area and then hit the export button. Alternatively, download links are provided for full country or region files after hitting ‘Export’.

image

The downloaded OSM file is in XML format and starts like this:

<?xml version="1.0" encoding="UTF-8"?>
<osm version="0.6" generator="CGImap 0.3.3 (733 thorn-01.openstreetmap.org)" copyright="OpenStreetMap and contributors" attribution="
http://www.openstreetmap.org/copyright" license="http://opendatacommons.org/licenses/odbl/1-0/">

Now, open Excel and create a new PowerQuery from an XML file:

image

Next, select the downloaded file and hit Enter. PowerQuery loads with the file contents:

image

The data we’re interested in is stored in the nodes-table. So we can now click on the ‘Table’ link in the nodes column:

image

We now have to expand the ‘tag’ information:

image

Since we’re not interested in all the data from the file, we limit the information to country, city, postcode, street and housenumber by using the filter function:

image

We can now delete all columns but the following:

  • tag.attribute.k
  • tag.attribute.v
  • attribute:id
  • attribute:lat
  • attribute:long

image

image

This is pretty much the data we’d like to obtain from the OSM file. However, we still need to pivot to information from the rows into columns to see city, postcode etc. side by side with the geo coordinates. In order to do so, we select column tag.Attribut:k and use the Pivot function from the Transform ribbon as shown below:

image

Finally, we can rename the columns and filter for rows where country is not null:

image

The list now contains addresses (city, postcode, street, house number) together with the corresponding geographical coordinates (latitude, longitude).

For your reference, this is the script we created so far:

let
    Source = Xml.Tables(File.Contents("C:\Temp\download.osm")),
    node = Source{0}[node],
    #"Changed Type" = Table.TransformColumnTypes(node,{{"Attribute:id", Int64.Type}, {"Attribute:visible", type logical}, {"Attribute:version", Int64.Type}, {"Attribute:changeset", Int64.Type}, {"Attribute:timestamp", type datetime}, {"Attribute:user", type text}, {"Attribute:uid", Int64.Type}, {"Attribute:lat", type number}, {"Attribute:lon", type number}}),
    #"Expand tag" = Table.ExpandTableColumn(#"Changed Type", "tag", {"Attribute:k", "Attribute:v"}, {"tag.Attribute:k", "tag.Attribute:v"}),
    #"Filtered Rows" = Table.SelectRows(#"Expand tag", each ([#"tag.Attribute:k"] = "addr:city" or [#"tag.Attribute:k"] = "addr:country" or [#"tag.Attribute:k"] = "addr:housenumber" or [#"tag.Attribute:k"] = "addr:postcode" or [#"tag.Attribute:k"] = "addr:street")),
    #"Removed Columns" = Table.RemoveColumns(#"Filtered Rows",{"Attribute:visible", "Attribute:version", "Attribute:changeset", "Attribute:timestamp", "Attribute:user", "Attribute:uid"}),
    #"Pivoted Column" = Table.Pivot(#"Removed Columns", List.Distinct(#"Removed Columns"[#"tag.Attribute:k"]), "tag.Attribute:k", "tag.Attribute:v"),
    #"Renamed Columns" = Table.RenameColumns(#"Pivoted Column",{{"Attribute:id", "id"}, {"Attribute:lat", "latitude"}, {"Attribute:lon", "longitude"}, {"addr:city", "city"}, {"addr:postcode", "postcode"}, {"addr:street", "street"}, {"addr:country", "country"}, {"addr:housenumber", "housenumber"}}),
    #"Filtered Rows1" = Table.SelectRows(#"Renamed Columns", each ([country] = "DE"))
in
    #"Filtered Rows1"

The result can then be loaded to a PowerPivot data model (click on File, then Load to…):

image

After loading the data into the model, you can for example use PowerMaps to visualize the addresses on a map:

image

Of course, you could have received the similar result using the geocoding functionality of PowerMaps. Therefore, you simply need to add another column for the full address like this:

image

The result is pretty much the same but it takes a significant amount of time to geo code many addresses on this level of detail compared to the direct approach from above, which doesn’t need to pass geo coding web service because we already provided latitude and longitude.

image

So, if you have to geo code a lot of addresses, the OpenStreetMap XML format may be a lot of help. Using PowerQuery makes it easy to load this data into a PowerPivot data model. For larger regions make sure you have enough memory available in the local machine (and use the 64bit version of Microsoft Excel).

Sunday, November 30, 2014

SQL Server Data Files on Azure

Azure | SQL Server 2014

With SQL Server 2014 it’s easy to move database files to the Azure Blog storage even if the SQL Server runs on premise. Azure Blob storage offers reliable, cheap and high available storage, which could be useful for “cold” data for example.

However, configuration is a little bit tricky, so I’m going to walk through this process step by step.

1. Create an Azure Blob store account and a container

Log into Azure and create a new storage account. For my example, I’m using “db4” as the name as shown below:

image

Next, I’m going to create a blob store container, which I name “data” here:

image

In order to access the container, we need the URL to the container (db4.core.windows.net/data in my example) and the storage key. The key can be obtained by clicking on “Manage Access Keys” on the bottom of the screen:

image

You can copy the key to the clipboard by clicking on the icon right besides the Primary Access Key box.

image

For the next task I’m using Windows Azure Storage Explorer (download here). Here you can add your storage account by pasting the access key into the storage account key input box:

image

 

2. Create a Shared Access Signature for the container

In Azure Storage explorer, select the container (data) and click on ‘Security’:

image

This brings up the following dialog. Make sure to select the permissions list, delete, read and write. After clicking on ‘Generate Signature’ a shared access signature is created. Copy this signature to the clipboard.

image

 

3. In SQL Server: Create a credential for the blob container

In SQL Server we’re using the create credential statement to create a credential for the blob store. Make sure to replace the secret key with the generated shared access signature from the last step (I just obfuscated the key by overwriting part of the key with ‘x’):

CREATE CREDENTIAL [https://db4.blob.core.windows.net/data]
WITH IDENTITY='SHARED ACCESS SIGNATURE',
SECRET = 'sv=2014-02-14&sr=c&sig=c%2Fxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx3%3A00%3A00Z&se=2014-12-31T23%3A00%3A00Z&sp=rwdl'

If you like, you can check the credentials with “select * from sys.credentials”:

image

 

4. In SQL Server: Create a database that uses the blob container

The next step is simple. We just need to create a database using the container as its storage:

CREATE DATABASE testdb
ON
( NAME = testdb_dat,
FILENAME = 'https://db4.blob.core.windows.net/data/TestData.mdf' )
LOG ON
( NAME = testdb_log,
FILENAME = 'https://db4.blob.core.windows.net/data/TestLog.ldf')

You can create tables and load data in just the same way as you would do with a local database file. Azure Storage Explorer lists the database files that are created:

image

 

5. Optional: Register the blob store in SQL Server Management Studio

You can register the blob store in SQL Server Management Studio by creating a connection to Azure:

image

The “Access Key” is the key we created in the first step and can simply be copied into the account key field:

image

After connecting to the Azure blob store, Management Studio shows our container together with the database files:

image

Of course, when placing database files on Azure, a connection is needed to the blob store. If you don’t have this connection, you will not be able to access the database:

image

 

Summary

With SQL Server 2014 it is easy to put data files on an Azure storage account even for an on premise SQL Server. Use cases include

  • store data that is not heavily queried
  • store data that you want to secure in a geo-redundant way
  • enhance the local storage of a SQL Server
  • perform a single table backup to the cloud
  • … and many more

Sunday, November 9, 2014

Combining multiple tables with valid from/to date ranges into a single dimension

Dimensional modeling

Tracking historical changes within a dimension is a common task in data warehousing and well covered by Ralph Kimball’s slowly changing dimension (SCD) methods. In short, the SCD methods proposed by Ralph Kimball assume, that the source system (for example the ERP system) doesn’t keep historical versions of its data records, so changes need to be detected at the time when data is loaded into the warehouse. To keep historical values, versions of the master data records are created to memorize each state of the original record together with a valid from/to timestamp so that fact data can be joined to corresponding dimension data. But the valid from/to dates are usually not a good idea for joining fact data to the associated dimensions because this would result in range lookups (ETL) or date range (between) joins (in SQL or ELT). The surrogate key concepts offers a good solution here, by assigning a unique key (the surrogate key) to each version of a record. Now, this key can be used as a direct inner join from the fact table to its dimensions. This approach moves the time consuming process of resolving date ranges from query time to data loading time, so it has to be performed only once.  Query performance now benefits from the simplified link structure between the tables.

However, there may be some cases, where you find valid from/to dates in the original source system. In this case, the historical values are provided by the source system and usually it’s not necessary for the data warehouse to track the changes. While this sounds to be much more simple than the case with missing validity dates, it’s usually a challenging situation, especially when past records (and their valid from/to dates) may be modified. For example, a given date range could be split or merged or the from and to dates may shift. In either case, the surrogate keys of some fact rows would point to the “wrong” dimension record afterwards. So, for these cases you will need to periodically reload parts of your data warehouse (for example the last three months) or in some rare cases track the changes and adjust the surrogate keys of the fact tables. I’m saying rare cases as update-operations on fact tables that are tuned for high volume bulk loads and bulk queries are usually not a good idea, so you may want to implement a partition-wise recreation of the fact table (partition switch operations) which adds some complexity to the overall workload management.

However, after this intro my post today is about a situation where you have several linked tables in the source system, all with a valid from/to date. You may find this situation for example in SAP’s human resources tables where the properties of an employee are stored in so called info types which are independently versioned by valid from/to date ranges. In this post, I’m using a much more simplified scenario with the following 4 tables:

Employee
image

Organizational Unit (OrgUnit)
image

Location
image

Company Car (Car)
image

The tables reflect a very simple human resources model of four tables, a base table Employee and three detail tables, all joined by the EmployeeNo-field. Each table may contain multiple versions of data and therefore each table has valid from/to fields to distinguish the versions. In my example I’m using the approach of an including ValidFrom and an excluding ValidTo. If you take a look at the first two rows of the OrgUnit table for example, this means that employee 100 was in the organizational unit “Delivery” from Jan 1, 2000 until December 31, 2013 and then starting with January 1 2014 in “PreSales”.
For each of the four tables, EmployeeNo together with ValidFrom forms a primary key.

One potential problem with such data is that since valid from/to are delivered from the source system, we need to make sure that these date ranges do not overlap. There might be scenarios where you need to deal with overlapping date ranges (for example, an employee may have none, one or many phone numbers at a given point in time, for example a cell phone and a land line). If you need to model such cases, many-to-many relations between fact and dimensional data may be a solution or you could move the information from the rows to columns of the new dimension table.  But for this example, I will keep it simple, so we don’t expect overlapping data in our source tables.

However, it’s always a good idea to check incoming data for consistency. The following query for example checks if there are overlapping date ranges in the Employee table by using window functions to retrieve the previous and next date boundaries:

select * from (
select
    EmployeeNo
    , [ValidFrom]
    , [ValidTo]
    , lag([ValidTo],1) over (partition by [EmployeeNo] order by [ValidFrom]) PrevValidTo
    , lead([ValidFrom],1) over (partition by [EmployeeNo] order by [ValidFrom]) NextValidFrom
from Employee
) CheckDateRange
where (PrevValidTo is not null and PrevValidTo>ValidFrom) or (NextValidFrom is not null and NextValidFrom<ValidTo)

Please note, that this query does not check for gaps but only for overlapping date ranges in a table. If you like to detect gaps too, you’ll need to change the > and < in the where condition to a <>, i.e.

…where (PrevValidTo is not null and PrevValidTo<>ValidFrom) or (NextValidFrom is not null and NextValidFrom<>ValidTo)

Running this check on all the four tables from above shows that the data is consistent (no faulty rows returned from the query above).

Next, we can start to combine all of the four tables to a single dimension table. Let’s first show the final result:

image

The information of the four tables is now combined into a single table. Whenever an attribute changes this is reflected by the valid from/to date range. So for example, the first change for employee 100 was the company car at June 1, 2008.

So, how do we get there? At first, as the resulting valid from/to dates need to reflect all date ranges from all of the four tables, I start by collecting all of those dates:

with
ValidDates as
(
select EmployeeNo, ValidFrom as Date from Employee
union
select EmployeeNo, ValidTo from Employee
union
select EmployeeNo, ValidFrom from OrgUnit
union
select EmployeeNo, ValidTo from OrgUnit
union
select EmployeeNo, ValidFrom from Location
union
select EmployeeNo, ValidTo from Location
union
select EmployeeNo, ValidFrom from Car
union
select EmployeeNo, ValidTo from Car
)

This gives a list of all valid from/to-dates by employee from all of the four tables with duplicates being removed (since I used a union, not a union all). This is how the result looks like:

image

Next, I’m using this information to build the new valid from/to date ranges by using a window function to perform a lookup for the next date:

with
ValidDates as …
,
ValidDateRanges1 as
(
select EmployeeNo, Date as ValidFrom, lead(Date,1) over (partition by EmployeeNo order by Date) ValidTo
from ValidDates
)
,
ValidDateRanges as
(
select EmployeeNo, ValidFrom, ValidTo from ValidDateRanges1
where ValidTo is not null
)

image

Please note, that we already have the 10 resulting rows from the final result (see above) with the correct date ranges but without information from our four tables yet. So, now we can join the four tables with the date range table making sure to include the proper date range in the join condition. Here’s the resulting query:

with
ValidDates as …
, ValidDateRanges1 as …
, ValidDateRanges as …

select      
      E.EmployeeNo
    , E.Name
    , E.EmployeeID
    , isnull(OU.OrgUnit,'unknown') OrgUnit
    , isnull(L.Building,'unknown') Building
    , isnull(L.Room,'unknown') Room
    , isnull(C.CompanyCarId,'no company car') CompanyCarId
    , D.ValidFrom, D.ValidTo
from Employee E
inner join ValidDateRanges D
on E.EmployeeNo=D.EmployeeNo and E.ValidTo>D.ValidFrom and E.ValidFrom<D.ValidTo
left join OrgUnit OU
on OU.EmployeeNo=D.EmployeeNo and OU.ValidTo>D.ValidFrom and OU.ValidFrom<D.ValidTo
left join Location L
on L.EmployeeNo=D.EmployeeNo and L.ValidTo>D.ValidFrom and L.ValidFrom<D.ValidTo
left join Car C
on C.EmployeeNo=D.EmployeeNo and C.ValidTo>D.ValidFrom and C.ValidFrom<D.ValidTo

Since we made sure that no date ranges are overlapping within a single table, the joins can only return at most one row per employee and date range. To deal with gaps (for example in the car table) I used the isnull-function here to replace the gaps with a meaningful value (for example ‘no company car’ or ‘unknown’).

One final remark: In most cases, the source tables may contain many more fields that are not relevant for the data warehouse. However, the valid from/to information reflects changes within these fields too. The above approach would result in more than necessary versions in this case. However, as long as your dimension does not get too big, this is not really bad. On the opposite, if you later decide to include more information from the source tables, you already have properly distinguished versions for this information so you do not need to correct fact rows afterwards. This could even make it a good idea to include valid from/to dates from other associated tables even if no other information from those tables is yet being used in the data warehouse.

But if your dimension gets too big with this approach, you could always ‘clean’ unnecessary version using a simple group-by select with min(ValidFrom) and max(ValidTo) grouping by all other columns.

So, this showed how to combine multiple tables into a single dimension. As mentioned above, you still need to create surrogate keys and if you cannot eliminate the need for past data changes, you will also need to handle.

Sunday, September 14, 2014

Performance optimizations when loading many small files in SSIS

SQL Server Integration Services (SSIS)

In general, reading text files from SQL Server Integration Services (SSIS) is not a complicated task. The flat file source offers a user friendly interface to deal with separators, header lines and code pages/unicode. It can even determine the best data type for each column by scanning sample rows from the text file. And if your flat file is in XML-format you can use the XML source component in SSIS to read the file’s contents.

In many cases however you will not just have a single file but a directory containing many files instead. SSIS offers the for-each loop container in this case, to create a loop over all those files. The for-each loop container also has a friendly user friendly interface, so you can also easily solve this task in SSIS.

In this post I’d like to discuss a scenario with many small files to import and I will compare the for-each loop approach with a single data flow approach. Therefore, the task was to read about 2,100 files in JSON format into a single SQL Server database table. Each file has up to 80 KB in size containing from 1 to about 350 rows of data with an average of about 310 rows per file. So, in total I had to import about 650,000 rows of data from about 2,100 files. Doesn’t sound like a big deal so far.

First I created a for-each container with a single data flow:

image

The for-each container’s type is set to a “Foreach File Enumerator” scanning all files from a given directory:

image

The data flow simply reads the current file (I’m not going into details about the JSON-file here but some library like JSON.Net will do), does some minor changes (derived column) and writes the results into a SQL Server database table:

image

Running the package imported all of the files into my data table BUT… it took much longer than expected. In fact, it took 8 minutes. So whatever my expectation was before running the package, this was way too slow. First I checked if I made some mistakes. The OLE DB destination was set to use a fast table load with a table lock not checking any constraints. So, this was ok. The destination table was a heap with no primary key, so there wasn’t a problem with index reorganization.

Checking the progress log revealed that validation, pre- and post execute events are executed for each file. And since each file contained only a few rows, very small batches were committed in each loop causing the bad performance. Also, sending small batches to a table may be a bad idea depending on your table geometry. For example, when using clustered columnstore index tables, sending small batches results in asynchronous compression cycles as explained here.

If the flat files are actually CSV files, the best approach is to use the MULTIFLATFILE Connection Manager. Actually I must admit that I wasn’t aware it existed, until a colleague showed it to me. So here are the instructions to find this connection manager: When you right-click in the connection section of your package, a dialog appears to choose the connection type. Click on “New Connection…” here.

image

In the following dialog you can choose the MULTIFLATFILE Connection Manager. It is configured in exactly the same way as the standard flat file connection manager, but now you can specify multiple files or directories to scan.

image

But since my source files we’ not in CSV file format, I had to go for a different approach here. I replaced the for-each loop container with C# code inside the script component from above. Here is the corresponding code I used:

public override void CreateNewOutputRows()
{
    String path = "c:\\temp\\JSON_Import";
    foreach (String filename in Directory.EnumerateFiles(path))
    {
        // process single file
    ...
    }
}

Again, I’m not going into details about the actual code for importing the JSON file here, but the code above shows how simple a for each loop can be implemented within a script (of course you will want to add some error-handling and use a package variable for the import folder instead of the constant string here). The remaining parts of the data flow were left unchanged.

This time, importing all of the 2,100 files took 10 seconds, so this approach was about 48 times faster, than the for-each loop container.

image

 

 

Conclusion

In SSIS, writing data to a database table using a data flow gives the best performance if you have a large number of rows. However, importing many small files from a directory using the for-each loop container results in the opposite: many inserts with just a few rows each. If you encounter performance degradations in such a scenario, using the MULTIFLATFILE connection manager or, if not possible, converting the for-each loop container and the file read operation itself into a script task, may result in a much better performance. To improve performance even more, you could also try to parallelize the script tasks (for example the first script importing files 1, 3, 5 … and the second one importing files 2, 4, 6, …).

Sunday, August 31, 2014

Create table as select (CTAS) with “not null” column

PDW v1 | PDW 2012 | APS

CTAS (Create Table As Select) is a common way on the Parallel Data Warehouse (PDW) to transform one table into another table for example for calculations or for ELT (Extract Load Transform) processes.

The general syntax is quite simple and explained in the PDW help file:

CREATE TABLE [ database_name . [ dbo ] . | dbo. ] table_name
        [ ( { column_name } [ ,...n ] ) ]
    WITH (
        DISTRIBUTION = { HASH( distribution_column_name ) | REPLICATE }
            [ , <CTAS_table_option> [ ,...n ] ] 
    )
    AS <select_statement>
[;]

<CTAS_table_option> ::=
    LOCATION = USER_DB
    | CLUSTERED COLUMNSTORE INDEX
    | CLUSTERED INDEX ( { index_column_name [ ASC | DESC ] } [ ,...n ] )
    | PARTITION( partition_column_name RANGE [ LEFT | RIGHT
        FOR VALUES ( [ boundary_value [,...n] ] ) ) ]

<select_statement> ::=
    [ WITH <common_table_expression> [ ,...n ] ]
    SELECT <select_criteria>

 

For example, a simple CTAS statement to copy the contents of one table (FactSales) to another table (FactSalesTmp) may look like this:

create table FactSalesTmp
with (distribution = hash(DateKey))
as select *  from FactSales

As you can see from the CTAS syntax definition above it’s not possible specify column properties like NULL/NOT NULL or constraints. In the help file there is a note about this:

You cannot specify NULL | NOT NULL for the columns in the CTAS statement; the nullability property is derived from the columns and expressions in the SELECT results.

However, if you’re using CTAS to fill an intermediate stage table used for partition switching operations the switch out table has to have exactly the same definition as the target table and this also means that the NULL/NOT NULL setting on the column has to be identical. So how is the nullability property derived from the expressions in the select results? To show this behavior, I’m using a simple fact table with the following definition:

create table FactSales (
      DateKey int not null
    , ProductKey int not null
    , StoreKey int not null
    , Quantity int not null
    , Amount decimal (13,4) not null
    , Costs decimal(13,4)
) with (distribution = hash(DateKey))

Using the simple CTAS statement from above (the one copying FactSales to FactSalesTmp) preserves the nullability of all columns. As long as you refer to existing table columns, the nullability is preserved from that columns.

But what about calculations? Let’s try the following CTAS statement:

create table FactSalesTmp
with (distribution = hash(DateKey))
as select *
, Amount*0.8 AS StandardCosts
from FactSales

I simply added another column here using a simple calculation. You can check the resulting table structure by choosing the context menu ‘View Code’ in Data Tools or by running the following query:

select name, is_nullable from sys.columns where object_id=object_id('FactSalesTmp')

Unbenannt

What you see, is that the calculation (StandardCosts) is understood to be nullable by the PDW. How can we mark this column as not null the CTAS statement?

Rewriting the CTAS from above using coalesce doesn’t solve the problem:

create table FactSalesTmp
with (distribution = hash(DateKey))
as select *
, coalesce(Amount*0.8,0) AS StandardCosts
from FactSales

 

However, using isnull does the trick:

create table FactSalesTmp
with (distribution = hash(DateKey))
as select *
, isnull(Amount*0.8,0) AS StandardCosts
from FactSales

 

Unbenann1

The reason for this behavior is the different handling of the data type for coalesce and isnull. So, if you want to have an expression being marked as ‘not null’ in a CTAS statement, use the isnull-function.

Sunday, August 10, 2014

The 'KeyColumns' #0 has NullProcessing set to 'UnknownMember', but the dimension doesn't have UnknownMember set to 'Visible' or 'Hidden'

SQL Server 2005-2014

 

By default, SSAS provides an automatically created member with the name ‘unknown’ for each dimension. This member is intended to be the home for all facts that don’t fit to a real member (provided from the data source). In the example above, fact data that does not match any of the listed product categories could be mapped to the unknown-element.

image

I’m saying ‘could’ and not ‘is’ because the rules for mapping fact data to the unknown-element can be configured in the dimension properties.

But using this mechanism has certain drawbacks:

  • Processing time increases a lot if one row is encountered which has to be mapped to unknown
  • Only one text (for example ‘unknown’, can be configured) for all attributes in the dimension
  • Cases are hard to find since you don’t see this mapping in the underlying data warehouse tables

 

In a good data warehouse design, ETL takes care of the correct mapping of fact data to its dimensions by using surrogate keys. Each join is then an inner join. In order to do so, dimension tables usually contain a row for the unknown element. Frequently, the surrogate key –1 is used for this row.

But following this best practice results in the SSAS dimension showing two elements for ‘unknown’: The dimension entry and the automatically created entry.

image

So, why does SSAS dimension have this build-in unknown element by default? If we build almost all SSAS cubes based on a good data warehouse design where the dimensions maintain their own unknown element, there is no need for an automatically created unknown element anymore. But since SSAS cube’s wizard is intended to work with most types of data structures, the unknown element is there by default. Without having ETL enforced surrogate keys you just cannot be sure, that every fact row maps to its dimensions.

So, as explained above, we want to remove this default unknown element in almost all SSAS cube development projects. This can be easily done in the properties dialog of the dimension:

image

There are four available options for the unknown member:

visible The unknown-member of the dimension exists and is visible
hidden The unknown-member of the dimension exists and is hidden
none The unknown-member of the dimension does not exist
automatic null The unknown-member of the dimension exists und is visible, if there are violations of the referential integrity (fact keys not found in in dimension).

Again, if we take care of the surrogate keys in the ETL process, there is no need for a dimension unknown element at all. So, the best option is, to disable it (UnknownMember set to none).

However, because of other default settings, you’re getting the following error when trying to deploy your SSAS model afterwards:

The 'KeyColumns' #0 has NullProcessing set to 'UnknownMember', but the dimension doesn't have UnknownMember set to 'Visible' or 'Hidden'

If you’re getting this error for the first time, it might not be clear, where to fix it, especially since there are two changes that need to be made:

 

1. Adjusting the dimension key attribute

If you look at the dimension, you’ll notice the red hash-line below the key attribute of your dimension.

image

In order to fix this, you’ll need to open the properties of that attribute. Now navigate to the key columns setting and expand the view for each of the columns (since you may have more than one column bindings for the attribute’s key) as shown in the following screenshot:

image

Here you can set the NullProcessing option to “Error”. The default is “UnkownMember” but since we just disabled this, this causes the error.

Remember to do this for each key column of this attribute.

 

2. Adjust null processing in the dimension usage.

The second place to modify is the dimension mapping. Therefore open the cube and go to the dimension usage tab. You’ll notice the red hash-line at the attribute (in my example, the Product ID):

image

In order to fix this, click the button near to the attribute (the one labeled with “-“) to open this dialog:

image

Click advanced to edit the properties of the mapping:

image

In the lower part of the dialog, you can set the “Null Processing” from “UnknownMember” to “Error”.

After this change you should be able to deploy the cube again.