Search This Blog

Sunday, November 29, 2009

A duplicate attribute key has been found when processing…

SQL Server 2005 | SQL Server 2008

This post is about a common error message during dimension processing I’ve been asked about quite a few times so I thought it would be worth posting about it. The error message says that a duplicate attribute key has been found when processing as shown in the following screenshot for a test cube (I just processed one dimension here):

image_thumb3

Here’s the full error message:

Errors in the OLAP storage engine: A duplicate attribute key has been found when processing: Table: 'dbo_Product', Column: 'ProductGroup', Value: ''. The attribute is 'Product Group'.

When you got to this article because you just ran into this problem you probably don’t want to read much about the background but only want a solution. Unfortunately I found at least three possible reasons for this error message:

Reason 1 (likely): The most likely reason for that error is that you are having NULL values in your attribute key column.If you simply created the attribute by dragging it from the source view, BIDS only sets the key column (name and value column default to the key column in this case), so for example if you have a column ‘Product Group’ in your source table and drag it to your dimension, the product group (Text field) will automatically become the key for this attribute. The attribute is listed in the error message (in the example above it is ‘Product Group’).

Solution: Try avoiding those NULL values in your data source (for example by using a DSV query and the T-SQL coalesce-function). When your source data is a data warehouse it’s also a good practice to avoid null values as they complicate the queries to the data warehouse.

Reason 2 (likely): You defined an attribute relationship between two attributes of the dimension but the data in your source tables violates the relationship. The error message gives you the name of the conflicting attribute (text part ‘The attribute is…’). The attributes has a relationship to another attribute but for the value stated in the error message (‘Value: …’) there are at least two different values in the attribute that the relationship refers to. If you have BIDS Helper installed, you can also see the error details and all violating references when using the ‘Dimension Health Check’ function.

Solution: You may solve the error by making the key of the attribute unique. For example:
Errors in the OLAP storage engine: A duplicate attribute key has been found when processing: Table: 'DimDate_x0024_', Column: 'Month', Value: 'April'. The attribute is 'Month'.
In this example, the Month attribute violates an attribute relationship (maybe Month->Year) for the month April meaning that April appears for more than one year. By adding the year to the key of the month attribute you would make the relationsship unique again.

Reason 3 (not that likely): You have an attribute with separate key and name source fields. When you check the data, you see that keys are appearing more than once with different entries in their name columns (note that it’s not a problem if the key appears more than once if only the name column is the same). In this case you will usually also see the key value in the error message, for example:
Errors in the OLAP storage engine: A duplicate attribute key has been found when processing: Table: 'dbo_Product2', Column: 'ProductCode', Value: '1'. The attribute is 'Product Name'.
This means that the attribute ‘Product Name’ uses the source column ‘ProductCode’ as the key and for the product code 1 there is more than one name.

Solution: Use a unique key column (unique with respect to the name column)

Long explanation Reason 1:

In this case our attribute is only defined by one single source column (acting as key, name and value information) from the data source view. When processing a dimension, SSAS run select distinct queries on the underlying source table, so a duplicated key should be impossible even if the key appears multiple times. Just think of a date dimension like the following one (just for years and months):

image_thumb15

In this case the year (2009) appears in multiple rows. However, defining an attribute year (using the the year column as the key) does not give a conflict as it is queried using a distinct query (so 2009 only appears once). So again, how could we get a duplicate result when using a select distinct query? Here is how my product table looked like:

image_thumb16

As you can see the ProductGroup column has one row with an empty string and another row with a NULL value. When SSAS queries this attribute during processing it runs the following SQL query (that can be captured using the profiler):

SELECT DISTINCT [dbo_Product].[ProductGroup] AS [dbo_ProductProductGroup0_0]
FROM [dbo].[Product] AS [dbo_Product]

The result of the query looks like this:

image_thumb

Now, with the default NULL processing for our dimension attribute being set to ‘Automatic’ meaning Zero (for numerical values) or Blank (for texts) the NULL value above is converted to an empty string. So the result set has two lines with an empty string and that causes the error.

image_thumb17

So the problem can be avoided if you don’t have null values in your column. This explains the first reason described above.

Long explanation Reason 2:

I blogged about attribute relationship before and you may want to read this post about defining the key for attributes in an attribute relationship.

Long explanation Reason 3:

Let’s take a look at the following modified product table.

image_thumb2

The ProductID column is unique while the ProductCode is not. If we now define the ProductName attribute as follows we will also get a duplicate key error:

image_thumb4

The reason here is that for the ProductCode 1 two names are found (and therefore the select distinct returns two lines with ProductCode 1), so ProductCode is not a good key here. The problem would disappear if the ProductName for the third line would also be ‘A’ (like for the first line) or the ProductCode for the third line would be other than 1 or 2.

However, this reason occurs rather seldom because usually if we have a key and a name in our dimension, the source comes from some kind of master data table and therefore it should be unique. But for type 2 slowly changing dimensions you must not use the business key as key column (as there may be many rows with the same business key).

Another way to “solve” duplicate key errors (although not recommended) is to set the “KeyDuplicate” property for the error processing of the dimension to “IgnoreError” as shown below:

image_thumb18

However, this is definitely not recommended except for prototyping scenarios. This is clearly explained here http://technet.microsoft.com/en-us/library/bb630297.aspx.

Sunday, November 15, 2009

How to define an Excel compliant format strings for a cube

SQL Server 2005 | 2008

Cube defined formats are not necessarily displayed properly in Excel 2007. Sometimes the format seems to be ignored completely although it is working fine in the cube browser.

For a simple example I created a cube based in the following source table which acts as dimension and fact at the same time:

UnitID FormatString SampleValue
1 #,##0.00 12345.678
2 #,##0.00€ 12345.678
3 #,##0.00\€ 12345.678
4 #,##0.00 EUR 12345.678
5 #,##0.00 \E\U\R 12345.678
6 #,##0.00" EUR" 12345.678
7 #,##0.00"€" 12345.678

The ideas is simply to try out different format strings. I used the first two columns for my sample dimension and the first and the last column for my fact table giving a very simple cube

image_thumb9

In order to use the format string to format the sample value, I used this simple cube script:

scope ([Dim Unit].[Unit Format].[Unit Format]);
format_string(this) = [Dim Unit].[Unit Format].currentmember.name;
end scope;

Now, let’s browse our cube using the cube browser:

image_thumb2

As you can see, every of our approaches for formatting the value worked as expected. Now let’s take a look at the same cube using Excel 2007:

image_thumb8

As you can see, not all format strings are also understood by Excel. Simply using € or writing EUR confuses Excel – the value is displayed without any format. And although the cube browser understands even these formats, it’s dangerous to use unescaped characters in your format string as many of them have a meaning (like HH for the hour in a 24 hour representation). So it’s always a good idea to escape the characters. As you can see, both the backspace and the quotation marks work fine with Excel so it’s up to you what you prefer (or how other clients interpret the format string…).

Sunday, November 8, 2009

Simple Banding function for KPI Status or KPI Trend

SQL Server 2008

Although the way of defining the KPI status and trend is very flexible it is also somehow circumstantial if you simply want to rate a KPI by its target value. In this case you would have to calculate the status of the KPI as being between –1 (worse) and +1 (best).

In many cases we find simple case statements here:

case
when KPIVALUE("TestKPI") >= 0.95 * KPIGOAL("TestKPI") then 1
when KPIVALUE("TestKPI") < 0.7 * KPIGOAL("TestKPI") then -1
else 0
end

This results in the KPI status being –1, 0 or +1. But if you want to blend smoothly between these values you have to use a formula that is a little bit more complex. Especially if you need to this computation for many KPIs it’s not really nice.

I had some examples for AS stored procedures (sprocs) in this blog before. So here is a very simple one to calculate a linear approximation between the boundaries:

public static double KPIStatusBanding(double bound1, double bound2, double actual)
{
double factor=0;

if (bound1 < bound2) factor = 1;
else if (bound1 > bound2) factor = -1;
else return 0;

if (actual < bound1) return -factor;
else if (actual > bound2) return factor;
else return factor*2 * (actual - bound1) / (bound2 - bound1) - 1;
}

The parameters are as follows:

bound1 worst value
bound2 best value
actual actual value

image_thumb1

If bound1 < bound2 bigger values are better, if bound1 > bound2, lower values are better. With this function the calculation is quite easy. First let’s test the function itself with simple MDX queries:

with
member test as ASStatistics!KPIStatusBanding(100,200,175)
select test on 0
from [Adventure Works]

This results in a status value of 0.5.

In order to use the same boundaries as in my first example, I would use this KPI status expression:

ASStatistics!KPIStatusBanding(
0.7 * KPIGOAL("TestKPI"),
0.95 * KPIGOAL("TestKPI"),
KPIVALUE("TestKPI"))

image_thumb11

As we can see from the KPI browser the values now gets approximated smoothly:

image_thumb3

Role-playing time dimensions and semi-additive measures

My last two posts have been about semi-additive measures, so this is part 3 of my semi-additive measures mini series. This post is about role-playing time dimensions together with semi-additive measures. The good news is, that this shouldn't happen too often. At least it took me some time to invent an example. However, since I couldn't use the Adventure Works samples database I had to create my own data.

This is my scenario: I measure the number of cars in certain car parks (dimension) over the time (dimension). I modeled the time as date and hour (one of my next posts will be about how to separate date and time to separate dimensions), so I'm getting hourly snapshots of my car parks. The model looks somewhat like this:

image_thumb4

Since my car parks are located all over the world, I want to see the number of cars that were in the car park by local time and by GMT thus explaining the two links from the fact table to the date dimension. In my SSAS project I defined the DimDate dimension as a time dimension. So we have role playing time dimensions (roles are local time and GMT). The number of cars is our semi additive measure (using the average aggregation function for the time dimension).

Just to get familiar with the data I generated as my test data I first started with two separate cubes, one for the local time and one for GMT. In this step we don't have multiple time dimensions per cube so it's our normal and safe situation.

image_thumb7

The result reflects the time shift, therefore in the GMT cube you can see the same values just shifted by 6 and 9 hours respectively. As each cube has just one time dimension the aggregation is just the way we would expect it to be. For example, the local time cube takes the average over the local time and sums up the values across the car parks.

Now, what happens, if we put both dimensions and our parking fact table into one cube? As we know, it depends on the question, which time dimension comes first. If the local date is the first time dimension my result looked like this:

image_thumb10

The aggregation across the local time is correct (average) as expected while the aggregation across GMT looks strange. If you remember my post about the effects of semi-additive measures on other dimensions you will not be surprised. What you can see from the picture above is that

  • The aggregation across the GMT is correct for each car park (it averages the details) although GMT is not treated as a time dimension
  • The aggregation of the total across multiple car parks is wrong (it also averages the details).
  • The grand total cannot be easily explained (neither the sum nor the average of one of the two axis).

How can that behavior be explained? Let's take a look at the details:

image31

For this table I simply put the local time on the columns and the GMT on the rows. I also filtered for our two car parks we analyzed above. Now what you can see is that you get multiple local time values for each GMT hour. When looking at GMT 0:00 you will see 2239 cars in Denver and 2052 in Toronto but both at different local time (16:00 and 18:00). As our cube takes the local time is the time dimension, it averages the number of cars in both car parks, resulting in 2146 cars as (2239+2052) / 2 = 2146. This explains the reason why we see averages here while aggregating different car parks.

So, if we take a look again at the table above, this explains the averages:

image_thumb19

The numbers 2239 and 2052 in the first line are resulting from different car parks but also from different local times and that's why we see the average as a total here.

This also explains the total of 1932 as it is the average across all sums of our GMT data (in the picture above it would be the average of the last line which is not completely visible).

Ok, some things to remark:

  1. The behavior in our example above is exactly the same if the second time dimension is not a role playing one but a real dimension (own data source and own database dimension definition). Therefore it makes no difference here how you model your dimensions.
  2. In order to get proper aggregates you could use cube scripts (but keep in mind that you can only override existing cube cells, so there are drawbacks with multi-selects for example using Microsoft Excel).
  3. The aggregation problem would not exist if you would use separate measure groups for each of the role playing dimensions (ie. FactUtilizationLocal and FactUtilizationGMT) or separate cubes.
  4. The above data has nothing to do with real car parks but it is randomly created (using a noised sine curve).

Semi-additive measures, multiple measure groups and time dimensions

My last post was about the effects of semi-additive measures on dimensions other than the time dimension. Now let's focus on the situation of having more than one dimension of type 'time' in our cube.

As we've seen from many posts this may result in our aggregations are not what we would expect. The effect becomes clearly visible with semi-additive measures (although you should always set the type of your time dimension in order to provide useful information to the OLAP-clients). Only the first dimension is treated as a time-dimension while all following time dimensions are treated as normal dimensions. If, for example, a specific measure is set to aggregate as 'average of children' this will only consider the first time dimension as the one to compute the average for.

Now, let's get a little bit more detailed on that 'first' time dimension. In this context, 'first' relates to the order of the dimensions in a cube which can be changed by drag'n'drop on the first tab of the cube file editor (lower left corner):

image_thumb1

Now, imagine we have two dimensions of type 'time' in our cube. Let's name them 'Local Date' and 'GMT Date'. The 'Local Date' dimension comes first. The questions which of those dimensions is actually considered the time dimension of a measure group depends on the dimension usage:

  • If the measure group is only linked to the 'Local Date' dimension, this becomes the time dimension for the measure group
  • If the measure group is only linked to the 'GMT Date' dimension, this becomes the time dimension for the measure group
  • If the measure group is linked to both dimensions, the 'Local Date' dimension becomes the time dimension for the measure group (because it appears first in the dimension list of the cube)

In this sense, 'linked' really means the dimension usage as being defined on the dimension usage tab of the cube file editor. Links that are only created in the date source view but not modeled in the cube have no influence in choosing the first time dimension. For example, if the measure group 'Utilization GMT' is linked to both time dimensions (Local Date and GMT date) in the date source view but only to 'GMT Date' in the dimension usage, then the 'GMT Date' dimension would be the time dimension for the measure group even if the 'Local Date' dimension appears in front of the 'GMT Date' dimension. Sounds confusing? Ok, here is the quintessence:

You may have multiple dimensions of type 'time' in your cube, but you should avoid measure groups that have more than one of them in their dimension usage.

In any case, having more than one time dimension in a cube can be error-prone, even if you follow the above recommendation. It would have been better if Microsoft put the 'time dimension' as an attribute to the measure group instead of simple choosing the one that first appears in the cube definition. If somebody changes the order for any reason the aggregation behavior in one or more of your measure groups might change too.

You might not going to face a situation with more than one time dimension too often. But what about role playing time dimensions and semi-additive measures? My next post will be about this topic.