Search This Blog

Sunday, August 26, 2012

One or two dimensions?

Dimensional modeling

This post is about a typical question in dimensional modeling. The question is about how to cluster the attributes for analysis into dimension tables in your data ware house. Usually this is done within the ETL process of your BI system, so this might not be a typical question for the SSAS design. However, the question might come up in the context of your data source views and also in the context of usability (which makes it exactly a question of the SSAS design).

In the elementary school of BI we learn that our master data tables do not necessarily be the dimensions, that our transactional data tables do not necessarily be our fact tables but that the dimensional model is somewhat re-grouping all the information that is stored in columns of those tables to a flexible, yet easy to understand model of fact and dimension tables.

For example, a typical operational data structure for orders, with transactional tables of order head and order detail, might result in a single order fact table with some of the order head information being stored as dimensional information.

So, for any two attributes in your source tables, you have to consider to put them into the same or into different dimension tables. Here are the two options:

 

Option 1: Same dimension table for both attributes

image

Option 2: Separate dimension tables for the two attributes

image

 

In order to give you some guideline with this decision, here are some aspects that favor one of the two options.

 

Aspect / requirement Favors one dimension table (option 1) Favors separate dimension tables (option 2)
Both attributes logically belong to the same business entity, for example the customer or the region or the company

Stern

Both attributes have a strong relationship, like one to many (or ‘almost’ one to many)

Stern

Both attributes naturally appear in the same hierarchy (for example product group and product)

Stern

 

The number of existing combinations of both attributes is significantly higher than the number of rows for the attribute with the higher row count (see below) 

Stern

There are fact tables that are only linked to one of the two attributes (for example planned revenue being linked to the product group, but not to the product, see below)  

Stern

The association between attribute 1 and attribute 2 is rapidly changing over time but must be reflected in the model (historical view must be kept, see below)  

Stern

It should be possible to analyze facts by the most current combination of the two attributes (see below)

Stern

 

While most of the aspects are self-explaining, I’d like to add some more information for some of the aspects here:

 

The number of existing combinations of both attributes is not significantly higher than the number of rows for the attribute with the higher row count.

Although both attributes might appear to belong to the same logical entity, putting them into one dimension table might dramatically increase the number of rows for this table. For example: Date and Time. A table for storing the data at day level contains about 3,650 rows in ten years (10 x 365). A table for storing time at the detail level of a minute contains only 1,440 rows (24 hours x 60 minutes). So keeping date and time separated results in two small dimension tables. But as each information in the time table exists for every day, the combined table would result in 3,650 x 1,440 = 5,256,000 rows. While most BI systems are capable of handing dimension of that size, it is still a very large dimension table.

 

There are fact tables that only be linked to one of the two attributes

In an earlier post I wrote about associating fact tables to different levels of granularity within the same dimension table. This is possible, however it might be confusing and it somewhat breaks the idea of a true star- or snowflake layout. But actually, this is a hard decision for the dimensional architect and has to be considered very carefully, because in most such scenarios where you face this situation, both attributes really belong to the same entity and may also form a hierarchy (one to many). But only one of the two fact tables has a detail level that is sufficient to link it to both attributes. For example, you may have revenue as actual and forecast in different fact tables. The actual revenue is at the day level but the forecast is at month level. Now, day and month perfectly belong together and even form a good hierarchy. And separating day and month into separate dimensions would be a strange solution. So consider for yourself if you want to separate those attributes or use the technique described in this post to keep them together.

 

The association between attribute 1 and attribute 2 is rapidly changing over time but must be reflected in the model (historical view must be kept)

In the context of preserving the historical view in our data ware house, we usually consider Kimball’s slowly changing dimension approach, especially slowly changing dimensions of type 2, 3 or combinations. However, separating our attributes in two different dimension tables automatically preserves the historical view, when being used from the same fact table (assuming the fact table is also linked to the date/time-dimension). This is somehow obvious but often ignored by dimensional architects. And while slowly changing dimensions of type 2 or 3 increase the number of rows in our dimension table and might not work for rapidly changing dimensions, separating the attributes keeps the individual dimension tables small.
Of course, the opposite is also true and explained as the next aspect:

 

It should be possible to analyze facts by the most current combination of the two attributes

In Kimball’s sense of slowly changing dimensions, this would be a type 1 slowly changing dimension situation. For example, the requirement is to only analyze customers (attribute 1) by their last city of residence (attribute 2). If, for any reason, customer and city are separated into different dimension tables (one named customer, the other named customer geographic for example) a relocation of the customer would require updating existing fact rows in order to get the present view (scd type 1). Updating existing fact rows is usually something you want to avoid, if possible.

No comments:

Post a Comment