Search This Blog

Sunday, August 26, 2012

One or two dimensions?

Dimensional modeling

This post is about a typical question in dimensional modeling. The question is about how to cluster the attributes for analysis into dimension tables in your data ware house. Usually this is done within the ETL process of your BI system, so this might not be a typical question for the SSAS design. However, the question might come up in the context of your data source views and also in the context of usability (which makes it exactly a question of the SSAS design).

In the elementary school of BI we learn that our master data tables do not necessarily be the dimensions, that our transactional data tables do not necessarily be our fact tables but that the dimensional model is somewhat re-grouping all the information that is stored in columns of those tables to a flexible, yet easy to understand model of fact and dimension tables.

For example, a typical operational data structure for orders, with transactional tables of order head and order detail, might result in a single order fact table with some of the order head information being stored as dimensional information.

So, for any two attributes in your source tables, you have to consider to put them into the same or into different dimension tables. Here are the two options:

 

Option 1: Same dimension table for both attributes

image

Option 2: Separate dimension tables for the two attributes

image

 

In order to give you some guideline with this decision, here are some aspects that favor one of the two options.

 

Aspect / requirement Favors one dimension table (option 1) Favors separate dimension tables (option 2)
Both attributes logically belong to the same business entity, for example the customer or the region or the company

Stern

Both attributes have a strong relationship, like one to many (or ‘almost’ one to many)

Stern

Both attributes naturally appear in the same hierarchy (for example product group and product)

Stern

 

The number of existing combinations of both attributes is significantly higher than the number of rows for the attribute with the higher row count (see below) 

Stern

There are fact tables that are only linked to one of the two attributes (for example planned revenue being linked to the product group, but not to the product, see below)  

Stern

The association between attribute 1 and attribute 2 is rapidly changing over time but must be reflected in the model (historical view must be kept, see below)  

Stern

It should be possible to analyze facts by the most current combination of the two attributes (see below)

Stern

 

While most of the aspects are self-explaining, I’d like to add some more information for some of the aspects here:

 

The number of existing combinations of both attributes is not significantly higher than the number of rows for the attribute with the higher row count.

Although both attributes might appear to belong to the same logical entity, putting them into one dimension table might dramatically increase the number of rows for this table. For example: Date and Time. A table for storing the data at day level contains about 3,650 rows in ten years (10 x 365). A table for storing time at the detail level of a minute contains only 1,440 rows (24 hours x 60 minutes). So keeping date and time separated results in two small dimension tables. But as each information in the time table exists for every day, the combined table would result in 3,650 x 1,440 = 5,256,000 rows. While most BI systems are capable of handing dimension of that size, it is still a very large dimension table.

 

There are fact tables that only be linked to one of the two attributes

In an earlier post I wrote about associating fact tables to different levels of granularity within the same dimension table. This is possible, however it might be confusing and it somewhat breaks the idea of a true star- or snowflake layout. But actually, this is a hard decision for the dimensional architect and has to be considered very carefully, because in most such scenarios where you face this situation, both attributes really belong to the same entity and may also form a hierarchy (one to many). But only one of the two fact tables has a detail level that is sufficient to link it to both attributes. For example, you may have revenue as actual and forecast in different fact tables. The actual revenue is at the day level but the forecast is at month level. Now, day and month perfectly belong together and even form a good hierarchy. And separating day and month into separate dimensions would be a strange solution. So consider for yourself if you want to separate those attributes or use the technique described in this post to keep them together.

 

The association between attribute 1 and attribute 2 is rapidly changing over time but must be reflected in the model (historical view must be kept)

In the context of preserving the historical view in our data ware house, we usually consider Kimball’s slowly changing dimension approach, especially slowly changing dimensions of type 2, 3 or combinations. However, separating our attributes in two different dimension tables automatically preserves the historical view, when being used from the same fact table (assuming the fact table is also linked to the date/time-dimension). This is somehow obvious but often ignored by dimensional architects. And while slowly changing dimensions of type 2 or 3 increase the number of rows in our dimension table and might not work for rapidly changing dimensions, separating the attributes keeps the individual dimension tables small.
Of course, the opposite is also true and explained as the next aspect:

 

It should be possible to analyze facts by the most current combination of the two attributes

In Kimball’s sense of slowly changing dimensions, this would be a type 1 slowly changing dimension situation. For example, the requirement is to only analyze customers (attribute 1) by their last city of residence (attribute 2). If, for any reason, customer and city are separated into different dimension tables (one named customer, the other named customer geographic for example) a relocation of the customer would require updating existing fact rows in order to get the present view (scd type 1). Updating existing fact rows is usually something you want to avoid, if possible.

Sunday, August 5, 2012

Semi Additive Measures in SQL Server Standard Edition

SQL Server 2008 | SQL Server 2008R2 | SQL Server 2012

A few years ago, Alex Whittles wrote a very detailed explanation about semi additive measures in a SQL Server Standard Edition. For my post today I’m concentrating on some other ideas, especially on dynamic sets in SQL Server 2008. Semi additive measures are measures that have a different way of aggregation over time (or other specific dimensions). For example, thinking of stock values you want to sum them up by product, location, country etc. but not by time (here you may want to see the last value, the average, the minimum/maximum or whatever).

For SQL Server SSAS (multidimensional) semi additive measures are a feature that is only available in the Enterprise Edition of the server. The following post shows methods how to implement such calculations on a Standard Edition. However, the implementation in the Enterprise Edition performs better than the custom calculation shown here.

For my example, I’m using pretty much the same source data as for my DAX post about semi additive measures.

This is how the sample data looks like when using the measure StockLevel:

t1

The time dimension is at day granularity. The stock values are associated with the last day of the month (in order to properly show the effect of the ‘Last Value’ semi additive function).

Since we use Sum as the aggregation function here, the grand total for each row is correct while the sub totals and and the grand total for the columns make no sense.

This is a job for semi additive measures! In order to see different effects, I created three of them named

  • EP Average
  • EP LastValue
  • EP LastNonEmpty

The prefix EP stands for Enterprise Edition. For example, this is the definition of “EP LastNonEmpty”:

t2

The following screen shot shows all of those measures side by side (for all products):

t3

The difference between ‘last value’ and ‘last non empty’ is about the total. Last non empty really takes the last non empty value within the current context – not within the full date dimension. So for the measure “EP LastNonEmpty” we see a total for 2011 but no total for April (as there is no value for April at all). The measure “EP LastValue” on the other hand gives no total for 2011 as there is no value for December 31, 2011.

Now let’s try to rebuild those measures.

 

Average over time (version 1)

First we start with the average over time. If you have (or can create) a distinct count measure for the date key of the fact table it is quite easy to compute the average over time. Let’s assume our distinct count measure is named [Date Key Distinct Count], then the calculation for the average looks like this:

CREATE MEMBER CURRENTCUBE.[Measures].[STD Average]
AS Measures.StockLevel / [Measures].[Date Key Distinct Count];

We can check the results be putting both measures side by side:

t4

Of course, the calculation also works if you filter for a specific product.

Average over time (version 2)

I just wanted to present a different approach in case you do not want to create a distinct measure (separate measure group). We still need to find out the number of days in a given date range. We can use a dynamic set in order to do so:

CREATE DYNAMIC SET CURRENTCUBE.[CurrentDateRange]
AS [Date].[Day].[Day], DISPLAY_FOLDER = 'Calculations'  ; 

CREATE MEMBER CURRENTCUBE.[Measures].[STD Average 2]
AS [Measures].[Stocklevel]
    /count(nonempty(existing CurrentDateRange, [Measures].[Stocklevel])) ;

The set needs to be dynamic in order to reflect a sub cube context (for example if Microsoft Excel 2007 and later is used to query the cube). Also, we have to filter out the dates with values as the average needs to count only existing date rows.

Comparing the three measures (Enterprise method and the two methods for the average) side by side shows that they are the same:

t5

 

Last value

For the last value, I’m also using the approach with the dynamic set here. We use the same set, so the calculation is simply:

CREATE MEMBER CURRENTCUBE.[Measures].[STD LastValue]
AS ([Measures].[Stocklevel], tail(existing CurrentDateRange,1).item(0)) ;

The expression tail (…, 1) gives the last item of the set. We evaluate our measure on this last item.

Before comparing the results with the enterprise measure, let’s quickly add the last one:

 

Last Non Empty

This formula is almost identical to the one I used before, except that we wrap the set in a nonempty(…) function. Again we take the last item (in this case this is the last item with data), so the function looks like this:

CREATE MEMBER CURRENTCUBE.[Measures].[STD LastNonEmpty]
AS ([Measures].[Stocklevel],
   tail(nonempty(existing CurrentDateRange, [Measures].[Stocklevel]),1).item(0)) ;

 

Here are our two last measures together with their corresponding Enterprise measures side by side:

t6

 

Of course you can use the calculation above also for other kind of aggregates. For example a minimum over time (sum over all other dimensions) would look like this:

CREATE MEMBER CURRENTCUBE.[Measures].[STD Min]
AS Min(existing CurrentDateRange, [Measures].[Stocklevel]);

And you can also use this kind of calculation for dimensions other than the time dimension.