Sunday, November 29, 2009

A duplicate attribute key has been found when processing…

SQL Server 2005 | SQL Server 2008

This post is about a common error message during dimension processing I’ve been asked about quite a few times so I thought it would be worth posting about it. The error message says that a duplicate attribute key has been found when processing as shown in the following screenshot for a test cube (I just processed one dimension here):

image

Here’s the full error message:

Errors in the OLAP storage engine: A duplicate attribute key has been found when processing: Table: 'dbo_Product', Column: 'ProductGroup', Value: ''. The attribute is 'Product Group'.

When you got to this article because you just ran into this problem you probably don’t want to read much about the background but only want a solution. Unfortunately I found at least three possible reasons for this error message:

Reason 1 (likely): The most likely reason for that error is that you are having NULL values in your attribute key column.If you simply created the attribute by dragging it from the source view, BIDS only sets the key column (name and value column default to the key column in this case), so for example if you have a column ‘Product Group’ in your source table and drag it to your dimension, the product group (Text field) will automatically become the key for this attribute. The attribute is listed in the error message (in the example above it is ‘Product Group’).

Solution: Try avoiding those NULL values in your data source (for example by using a DSV query and the T-SQL coalesce-function). When your source data is a data warehouse it’s also a good practice to avoid null values as they complicate the queries to the data warehouse.

Reason 2 (likely): You defined an attribute relationship between two attributes of the dimension but the data in your source tables violates the relationship. The error message gives you the name of the conflicting attribute (text part ‘The attribute is…’). The attributes has a relationship to another attribute but for the value stated in the error message (‘Value: …’) there are at least two different values in the attribute that the relationship refers to. If you have BIDS Helper installed, you can also see the error details and all violating references when using the ‘Dimension Health Check’ function.

Solution: You may solve the error by making the key of the attribute unique. For example:
Errors in the OLAP storage engine: A duplicate attribute key has been found when processing: Table: 'DimDate_x0024_', Column: 'Month', Value: 'April'. The attribute is 'Month'.
In this example, the Month attribute violates an attribute relationship (maybe Month->Year) for the month April meaning that April appears for more than one year. By adding the year to the key of the month attribute you would make the relationsship unique again.

Reason 3 (not that likely): You have an attribute with separate key and name source fields. When you check the data, you see that keys are appearing more than once with different entries in their name columns (note that it’s not a problem if the key appears more than once if only the name column is the same). In this case you will usually also see the key value in the error message, for example:
Errors in the OLAP storage engine: A duplicate attribute key has been found when processing: Table: 'dbo_Product2', Column: 'ProductCode', Value: '1'. The attribute is 'Product Name'.
This means that the attribute ‘Product Name’ uses the source column ‘ProductCode’ as the key and for the product code 1 there is more than one name.

Solution: Use a unique key column (unique with respect to the name column)

 

Long explanation Reason 1:

In this case our attribute is only defined by one single source column (acting as key, name and value information) from the data source view. When processing a dimension, SSAS run select distinct queries on the underlying source table, so a duplicated key should be impossible even if the key appears multiple times. Just think of a date dimension like the following one (just for years and months):

image

In this case the year (2009) appears in multiple rows. However, defining an attribute year (using the the year column as the key) does not give a conflict as it is queried using a distinct query (so 2009 only appears once). So again, how could we get a duplicate result when using a select distinct query? Here is how my product table looked like:

image

As you can see the ProductGroup column has one row with an empty string and another row with a NULL value. When SSAS queries this attribute during processing it runs the following SQL query (that can be captured using the profiler):

SELECT DISTINCT [dbo_Product].[ProductGroup] AS [dbo_ProductProductGroup0_0]
FROM [dbo].[Product] AS [dbo_Product]

The result of the query looks like this:

image

Now, with the default NULL processing for our dimension attribute being set to ‘Automatic’ meaning Zero (for numerical values) or Blank (for texts) the NULL value above is converted to an empty string. So the result set has two lines with an empty string and that causes the error.

image

So the problem can be avoided if you don’t have null values in your column. This explains the first reason described above.

 

Long explanation Reason 2:

I blogged about attribute relationship before and you may want to read this post about defining the key for attributes in an attribute relationship.

 

Long explanation Reason 3:

Let’s take a look at the following modified product table.

image

The ProductID column is unique while the ProductCode is not. If we now define the ProductName attribute as follows we will also get a duplicate key error:

image

The reason here is that for the ProductCode 1 two names are found (and therefore the select distinct returns two lines with ProductCode 1), so ProductCode is not a good key here. The problem would disappear if the ProductName for the third line would also be ‘A’ (like for the first line) or the ProductCode for the third line would be other than 1 or 2.

However, this reason occurs rather seldom because usually if we have a key and a name in our dimension, the source comes from some kind of master data table and therefore it should be unique. But for type 2 slowly changing dimensions you must not use the business key as key column (as there may be many rows with the same business key).

Another way to “solve” duplicate key errors (although not recommended) is to set the “KeyDuplicate” property for the error processing of the dimension to “IgnoreError” as shown below:

image

However, this is definitely not recommended except for prototyping scenarios. This is clearly explained here http://technet.microsoft.com/en-us/library/bb630297.aspx.

Sunday, November 15, 2009

How to define an Excel compliant format strings for a cube

SQL Server 2005 | 2008

Cube defined formats are not necessarily displayed properly in Excel 2007. Sometimes the format seems to be ignored completely although it is working fine in the cube browser.

For a simple example I created a cube based in the following source table which acts as dimension and fact at the same time:

UnitID FormatString SampleValue
1 #,##0.00 12345.678
2 #,##0.00€ 12345.678
3 #,##0.00\€ 12345.678
4 #,##0.00 EUR 12345.678
5 #,##0.00 \E\U\R 12345.678
6 #,##0.00" EUR" 12345.678
7 #,##0.00"€" 12345.678

 

The ideas is simply to try out different format strings. I used the first two columns for my sample dimension and the first and the last column for my fact table giving a very simple cube

image

In order to use the format string to format the sample value, I used this simple cube script:

scope ([Dim Unit].[Unit Format].[Unit Format]);
format_string(this) = [Dim Unit].[Unit Format].currentmember.name;
end scope;

Now, let’s browse our cube using the cube browser:

image

As you can see, every of our approaches for formatting the value worked as expected. Now let’s take a look at the same cube using Excel 2007:

image

As you can see, not all format strings are also understood by Excel. Simply using € or writing EUR confuses Excel – the value is displayed without any format. And although the cube browser understands even these formats, it’s dangerous to use unescaped characters in your format string as many of them have a meaning (like HH for the hour in a 24 hour representation). So it’s always a good idea to escape the characters. As you can see, both the backspace and the quotation marks work fine with Excel so it’s up to you what you prefer (or how other clients interpret the format string…).

Sunday, November 8, 2009

Simple Banding function for KPI Status or KPI Trend

SQL Server 2008

Although the way of defining the KPI status and trend is very flexible it is also somehow circumstantial if you simply want to rate a KPI by its target value. In this case you would have to calculate the status of the KPI as being between –1 (worse) and +1 (best).

In many cases we find simple case statements here:

case
when KPIVALUE("TestKPI") >= 0.95 * KPIGOAL("TestKPI") then 1
when KPIVALUE("TestKPI") < 0.7 * KPIGOAL("TestKPI") then -1
else 0
end

This results in the KPI status being –1, 0 or +1. But if you want to blend smoothly between these values you have to use a formula that is a little bit more complex. Especially  if you need to this computation for many KPIs it’s not really nice.

I had some examples for AS stored procedures (sprocs) in this blog before. So here is a very simple one to calculate a linear approximation between the boundaries:

public static double KPIStatusBanding(double bound1, double bound2, double actual)
{
    double factor=0;

    if (bound1 < bound2) factor = 1;
    else if (bound1 > bound2) factor = -1;
    else return 0;

    if (actual < bound1) return -factor;
    else if (actual > bound2) return factor;
    else return factor*2 * (actual - bound1) / (bound2 - bound1) - 1;
}

The parameters are as follows:

bound1 worst value
bound2 best value
actual actual value

 

image

If bound1 < bound2 bigger values are better, if bound1 > bound2, lower values are better. With this function the calculation is quite easy. First let’s test the function itself with simple MDX queries:

with
member test as ASStatistics!KPIStatusBanding(100,200,175)
select test on 0
from [Adventure Works]

This results in a status value of 0.5.

In order to use the same boundaries as in my first example, I would use this KPI status expression:

ASStatistics!KPIStatusBanding(
0.7 * KPIGOAL("TestKPI"),
0.95 * KPIGOAL("TestKPI"),
KPIVALUE("TestKPI"))

image

As we can see from the KPI browser the values now gets approximated smoothly:

image

Role-playing time dimensions and semi-additive measures

My last two posts have been about semi-additive measures, so this is part 3 of my semi-additive measures mini series. This post is about role-playing time dimensions together with semi-additive measures. The good news is, that this shouldn't happen too often. At least it took me some time to invent an example. However, since I couldn't use the Adventure Works samples database I had to create my own data.

This is my scenario: I measure the number of cars in certain car parks (dimension) over the time (dimension). I modeled the time as date and hour (one of my next posts will be about how to separate date and time to separate dimensions), so I'm getting hourly snapshots of my car parks. The model looks somewhat like this:

image

Since my car parks are located all over the world, I want to see the number of cars that were in the car park by local time and by GMT thus explaining the two links from the fact table to the date dimension. In my SSAS project I defined the DimDate dimension as a time dimension. So we have role playing time dimensions (roles are local time and GMT). The number of cars is our semi additive measure (using the average aggregation function for the time dimension).

Just to get familiar with the data I generated as my test data I first started with two separate cubes, one for the local time and one for GMT. In this step we don't have multiple time dimensions per cube so it's our normal and safe situation.

image

The result reflects the time shift, therefore in the GMT cube you can see the same values just shifted by 6 and 9 hours respectively. As each cube has just one time dimension the aggregation is just the way we would expect it to be. For example, the local time cube takes the average over the local time and sums up the values across the car parks.

Now, what happens, if we put both dimensions and our parking fact table into one cube? As we know, it depends on the question, which time dimension comes first. If the local date is the first time dimension my result looked like this:

image

The aggregation across the local time is correct (average) as expected while the aggregation across GMT looks strange. If you remember my post about the effects of semi-additive measures on other dimensions you will not be surprised. What you can see from the picture above is that

  • The aggregation across the GMT is correct for each car park (it averages the details) although GMT is not treated as a time dimension
  • The aggregation of the total across multiple car parks is wrong (it also averages the details).
  • The grand total cannot be easily explained (neither the sum nor the average of one of the two axis).

How can that behavior be explained? Let's take a look at the details:

image

For this table I simply put the local time on the columns and the GMT on the rows. I also filtered for our two car parks we analyzed above. Now what you can see is that you get multiple local time values for each GMT hour. When looking at GMT 0:00 you will see 2239 cars in Denver and 2052 in Toronto but both at different local time (16:00 and 18:00). As our cube takes the local time is the time dimension, it averages the number of cars in both car parks, resulting in 2146 cars as (2239+2052) / 2 = 2146. This explains the reason why we see averages here while aggregating different car parks.

So, if we take a look again at the table above, this explains the averages:

image

The numbers 2239 and 2052 in the first line are resulting from different car parks but also from different local times and that's why we see the average as a total here.

This also explains the total of 1932 as it is the average across all sums of our GMT data (in the picture above it would be the average of the last line which is not completely visible).

Ok, some things to remark:

  1. The behavior in our example above is exactly the same if the second time dimension is not a role playing one but a real dimension (own data source and own database dimension definition). Therefore it makes no difference here how you model your dimensions.
  2. In order to get proper aggregates you could use cube scripts (but keep in mind that you can only override existing cube cells, so there are drawbacks with multi-selects for example using Microsoft Excel).
  3. The aggregation problem would not exist if you would use separate measure groups for each of the role playing dimensions (ie. FactUtilizationLocal and FactUtilizationGMT) or separate cubes.
  4. The above data has nothing to do with real car parks but it is randomly created (using a noised sine curve).

Semi-additive measures, multiple measure groups and time dimensions

My last post was about the effects of semi-additive measures on dimensions other than the time dimension. Now let's focus on the situation of having more than one dimension of type 'time' in our cube.

As we've seen from many posts this may result in our aggregations are not what we would expect. The effect becomes clearly visible with semi-additive measures (although you should always set the type of your time dimension in order to provide useful information to the OLAP-clients). Only the first dimension is treated as a time-dimension while all following time dimensions are treated as normal dimensions. If, for example, a specific measure is set to aggregate as 'average of children' this will only consider the first time dimension as the one to compute the average for.

Now, let's get a little bit more detailed on that 'first' time dimension. In this context, 'first' relates to the order of the dimensions in a cube which can be changed by drag'n'drop on the first tab of the cube file editor (lower left corner):

image

 

Now, imagine we have two dimensions of type 'time' in our cube. Let's name them 'Local Date' and 'GMT Date'. The 'Local Date' dimension comes first. The questions which of those dimensions is actually considered the time dimension of a measure group depends on the dimension usage:

  • If the measure group is only linked to the 'Local Date' dimension, this becomes the time dimension for the measure group
  • If the measure group is only linked to the 'GMT Date' dimension, this becomes the time dimension for the measure group
  • If the measure group is linked to both dimensions, the 'Local Date' dimension becomes the time dimension for the measure group (because it appears first in the dimension list of the cube)

In this sense, 'linked' really means the dimension usage as being defined on the dimension usage tab of the cube file editor. Links that are only created in the date source view but not modeled in the cube have no influence in choosing the first time dimension. For example, if the measure group 'Utilization GMT' is linked to both time dimensions (Local Date and GMT date) in the date source view but only to 'GMT Date' in the dimension usage, then the 'GMT Date' dimension would be the time dimension for the measure group even if the 'Local Date' dimension appears in front of the 'GMT Date' dimension. Sounds confusing? Ok, here is the quintessence:

You may have multiple dimensions of type 'time' in your cube, but you should avoid measure groups that have more than one of them in their dimension usage.

In any case, having more than one time dimension in a cube can be error-prone, even if you follow the above recommendation. It would have been better if Microsoft put the 'time dimension' as an attribute to the measure group instead of simple choosing the one that first appears in the cube definition. If somebody changes the order for any reason the aggregation behavior in one or more of your measure groups might change too.

You might not going to face a situation with more than one time dimension too often. But what about role playing time dimensions and semi-additive measures? My next post will be about this topic.

Sunday, August 23, 2009

Effects of semi-additive measures on other dimensions

Semi-additive measures are measures that don't use the same aggregation method along all of the cube's dimensions. In SSAS the time dimension plays an important role here. For example, if you choose the 'AverageOfChildren' aggregation method, the measure is averaged over the time but summed up over all other dimensions. When do we need this? Well, usually semi-additive measures are used when working with snapshot data like stock levels, balances of accounts etc.

For an example, let's look at a car park. We want to analyze the number of cars that are in one of our two park houses at a certain time. For our example we use a very simple fact table looking like this (I simplified the time to just 2 days, so imagine we're counting the cars at 1pm):

1

As you can see, someone also wanted to analyze the weather at each park house and each day so that we can easily see the effect the weather has on the parking behavior (obviously more people took the car when the weather was bad...).

Based on the fact data above we use a very simple data model looking like this:

 

2 

 

Now let's do some analysis on that model. First we analyze by date and park house:

3a

This clearly shows the semi-additive behavior of our measure: For each row (combining the two days) the value is averaged while it gets summed up across the park houses (columns) as expected. At day 1, 1pm we had 10 cars in car park P1. At the same time on day 2 there were 30. Of course we cannot add these values together but the average makes sense. So the meaning of the grand total is 'at 1pm there was an average of 20 cars in our car park P1'.

Now, what would happen if we start analyzing by the weather? As the weather is not our time dimension you might expect values to add up as a simple sum. So let's try:

3

Hmm, what happened here? The sum would have been 75 but our OLAP query results in a value of 50. What had happened and what is the total? You can easily see that it is neither the sum nor the average of the above values.

In order to clarify this behavior, let's also include the time in our analysis:

4

Using this view we can clearly see that SSAS made no mistake. The total number of cars being in our park houses at an average weather is 25 (average of 20 and 30, line one in the grid above), so this is correct. On the other hand, we had 30 cars in our park houses at day 1 (20 in a park house with average weather at that day and 10 in a park house with good weather) and 70 at day 2 giving a total average of 50. So this is also correct.

So the total for the weather is not the average or the sum of it's detailed values but the average of the summed up values.

This is a very simple example of how semi-additive aggregation of a measure also influences aggregation against non-time dimensions as well.

With this in mind you are now well prepared to look at the following analysis based on the same source data:

5

This time we left the time dimension completely out of the analysis resulting in averages along both other dimensions (park house and weather) as all values we actually see are already averages.

Surely, it's less confusing when looking at a single point of time like in the example below, where everything sums up neatly:

6

Sunday, August 9, 2009

Calculating the proper ISO week for the time dimension

There are some definitions for calendar weeks and T-SQL offers many of those calculations out of the box. When you're doing projects in Europe you will frequently find the ISO week definition for the calendar week as of ISO 8601. Week 01 in the ISO week calendar is defined is week with the year's first Thursday in it and weeks start with Monday and end with Sunday. The notation described in the ISO looks for exmaple like this 2009-W28-6. This would mean the sixth day of the ISO week (i.e. Saturday as the ISO week starts with Monday) of week 28 in the year 2009.

 

Fortunately T-SQL 2008 knows how to compute the ISO week number using the datapart function, for example

declare @dt date;
set @dt=CAST('2009-7-11' as DATE)
select DATEPART(ISO_WEEK,@dt)

results in 28 (week 28).

Now for the day of the week we can again use datepart, this time with the WEEKDAY parameter:

select DATEPART(WEEKDAY,@dt)

Depending on your machine settings this will most likely return 7 (as Saturday is the 7th day of the week when starting with Sunday=1). We can override the first day setting by using the following statement:

set datefirst 1;

 

After doing so the result of the above statement would be 6 as required. If you don't want to use the datefirst setting you could as well correct the day number manually using the modulus operator %:

select (DATEPART(weekday,@dt)+5) % 7+1

 

So up till now we have our week and the day. The remaining thing is the year. You might expect this to be very easy as the T-SQL year function extracts the year from a given date. But the year of the calendar week might differ from the calendar year, for example December 29, 2008 would be in the ISO week 1 of 2009 (2009-W1-1).

Therefore we need to compute the proper week. A way to do so is like this:

select DATEPART(year,DATEADD(d,-(DATEPART(weekday,@dt)+5) % 7+3,@dt))

 

Putting it all together we end up with the following computation:

cast(DATEPART(year,DATEADD(d,-(DATEPART(weekday,@dt)+5) % 7+3,@dt)) as nvarchar)
+N'-W'+CAST(DATEPART(ISO_WEEK,@dt) as nvarchar)
+N'-'+cast((DATEPART(weekday,@dt)+5) % 7+1 as nvarchar)

 

In order to use this expression in your data source views (DSV) computation you would have to replace the @dt variable with the specific date field and you have to make sure that the datefirst parameter is set to 7 (default), i.e.

select @@datefirst

should return the value 7.