Search This Blog

Thursday, December 27, 2012

7 reasons why you should prefer SCD 1 over SCD 2+

dimensional modeling

When it comes to dimension design a common question is about dealing with attributes that are changing over time. A frequently used pattern for such attributes was developed by Ralph Kimball with his concept of slowly changing dimensions (SCD). One aspect is to handle source systems that simply overwrite their master data, while you want to preserve past attribute properties for your analytics. I’m not going in to too much detail here, but focus on the 2 most important types:

Type 1:changes are directly overwritten in the dimension table. Past values are not preserved but get lost

Type 2: Past values for dimension data is kept using date from/to columns and/or a current row indicator

There are more types and also combinations of types (hybrid types). You can find much more about slowly changing dimensions here.

When talking to other BI architects I frequently hear the opinion that type 2 should be used for almost every attribute. I wouldn’t agree to this statement and try to use SCD type 1 wherever it is possible and type 2 only, if there is a real business requirement for type 2. To be clear, there’s nothing wrong with type 2, but it adds a significant complexity to your BI solution and in some cases, also has other downsides to be aware of. This post collects 7 topics that you should know about when deciding for type 1, type 2 or any other method for dealing with historical values.

 

1. Corrections versus historically relevant changes

Usually, we don’t want corrections, like a typo in a customer’s name, to be kept historically. For example, if a customer’s name was corrected from ‘Miler Inc.’ to ‘Miller Inc.’, usually we don’t want to see this as two different customers in our reports:

Instead of

image

we want to see

image

However, if the Miller Inc. consolidates with Hiller Inc. to “Miller & Hiller Inc.”, we might want to see them as separate customers, for example like this:

image

So, the problem with corrections is obvious. When a change of data happens in the source system, we might not be able to distinguish corrections from changes that need to be preserved over time. One option could be to use a function like soundex to treat minor changes to such a field as a type 1 modification (to the most current type 2 entry) and all others as a type 2 change. For the example from above, “Miler Inc.” and “Miller Inc.” both result in soundex code “M460”. But using soundex (or a similar function) does not guarantee us to get the result we wanted, i.e. to decide between corrections and modifications at a reliable rate, it’s still just a guess.

So it’s much more common to treat attributes/fields where we expect more corrections as SCD-1 and attributes that rather experience structural changes as SCD-2. Again, this does not necessarily guarantee to be correct.

In many cases where we need SCD-2, a good approach is the following:

  • If we only have a text-field, make it SCD-1
  • If we have a key/value field (because the value is taken from another table), make it a
    • SCD-2 change, if the key changes
    • SCD-1 change, if the text for that key changes

In special cases, things might be more complicated. For example, if you have ZIP code and city, a change to the ZIP code alone may rather be an SCD-1 change (more likely to be a correction), while a change in the ZIP code and the city name more likely indicates a relocation and therefore results in an SCD-2 change.

In any way, we’re trying to guess the idea the user had, when doing the modification, just by looking at the result.

 

2. Comparing with data from the past

When the requirement is to compare current data with data from the past, it seems to be obvious that we need SCD-2 to preserve the correct dimension structures from the past. But is this really true? Well, actually it depends on what you need the analysis for. For example, take a look at the following chart, showing sales of a specific product group.

image

First thing to note is that this chart breaks a lot of information design rules. In this chart confusion is intensified with a faked continuity by using a line chart (shame on me). The line chart here obfuscates the real situation by showing a continuous descent in September while actually the value is falling instantly here. Now, what can you say when looking at this line chart?

  • The product group performed ok between January and September
  • September was a great month
  • Between September and October a catastrophe must have happened. Maybe a competitor launched better and cheaper products or there were reasons to cause a bad publicity

And how would an end user react, if you say that all this is not true? The descent in October is nothing but a SCD-2 change (a what?) as one major product was moved to a different product group. In fact nothing happened, not a catastrophe, no competitor being better, just a change in the master data. This is a typical situation with SCD-2 changes over time: Results are not self-explaining and may cause jerky leaps when data is plotted over time.

If the product hierarchy was maintained as SCD-1, maybe the chart would look like the green line below:

image

I think you can clearly see the advantage. The green line (SCD-1) is free of SCD-2 leaps, so it’s much easier to understand for the end-users. The interpretation is more like:

  • We had some problems between January and October
  • but we managed a turn-around in November and December
So, whenever you like to see trends, derive insight for future development or find outliers, the SCD-1 view is much easier to understand without additional explanation about master data changes.

 

3. Mixture of different concepts (type 1, type 2 …)

Maybe we don’t want all attributes to be type 2. The requirement could be that some hierarchies are on type 2 while others are on type 1. Again, think about the end users. Mixing both concepts may be very confusing. In some aspects you may be able to compare with printed reports from the past, in some aspects data is shown differently. However, making all attributes type 2 is also a problem (see ‘corrections’ above). One option might be to use perspectives: One perspective for the current type-1 view, one for the past type-2 view. However, another option would be to question if you really need the type-2 view, especially in a mixed form.

 

4. Justification or reproduction of results from the past

The requirement for keeping dimensional data in its historical form (past view, type-2) often results from the need for justification. If we want to be able to reproduce results in the same way as they were at some point of time in the past, we need to keep track of changes. Maybe someone asks you why you haven’t already seen the sales problems from our example from above (green line) in April. With the SCD-2 type of the chart (blue line from above), you could prove that the chart for April was still fine, so obviously you’re not responsible for the mess (this doesn’t solve the problem, but maybe a good justification). Other reasons to reproduce a report with exactly the same values from the past, are compliance reasons (SOX, BASEL II etc.). However, if you build your BI solution to justify yourself, keep in mind:

  • Keeping historical values for the dimensions is not sufficient in most cases
  • Fact data may have been corrected or may not have been available at the time the report was generated in the past

So, often enough, you also need to preserve the historical state of your fact data. Maybe you end up with a separate key date dimension to show the data exactly as it was at this point in time. Besides the need for keeping the historical state of all attributes in all dimensions (even for corrections!), this may also raise the complexity and costs of your BI-system significantly. So, if you don’t have these requirements, there is no reason to implement this historical view.

 

5. What is the “real” time of the change?

If we have virtual structures that only exist in some master data system, that are only maintained there and that are not delimited in time (valid date from/to), the SCD-2 detection usually works as expected. This might be true for typical objects in controlling, a cost center hierarchy or a profit center. In other cases, the time of change in the master data system, must not necessarily match the real time of the change. Think of a senior sales manager. A very important customer was assigned to the sales manager at the beginning of the year. Then the sales manager wins a big deal with this customer. But the BI system doesn’t show the revenue associated to this sales manager. What went wrong? Usually, the sales manager wants to know in such a situation. Let’s assume, the customer relationship department simply forgot to move the customer over to the new sales manager. After being notified they immediately correct this error. But then? If the BI system has no valid from/to date information from the source system, it is only capable of observing the date where the change occurred. Data from the past will still be associated to the wrong sales manager, usually causing trouble and confusion. Of course, there is no simple solution for this case. If you don’t have a proper valid date range from the source system, this issue can only be solved by manually correcting this case. The thing, I want to point out here, is that SCD-2 alone might not be able to solve such situations as it is impossible to guess the real date for a change from observing the change in a master data table.

 

6. Persisted data in BI-system

Of course it’s normal to persist data in the data warehouse system. But many smaller BI systems don’t need to do that, as all data is available in some source systems. This situation has some benefits. For example it’s easy to include other topics of data for the full history (complete re-load). Also, there is no need to back-up the data warehouse. So, if you are in such a good situation and a requirement is to include SCD-2 while the source system simply overwrites its data, you can no longer take all that data from the source system. Again, not much of a problem, but it adds some kind of complexity to you BI-system.

 

7. Testing

If the source system overwrites its data while the BI system maintains history in a SCD-2 style, it’s more difficult to test the results from you source system with the BI system as your source system can only report data in the most current state while you BI-system reports some (or all) attributes in the historic perspective. Now, testing is very important and in many cases you can even do automatic testing (using a tool like BI.Quality). So, if you cannot reproduce the current state of the data from your BI-system, testing is very difficult. So for many situations, my recommendation is to use SCD-6 (1+2+3) instead of SCD-2 even if the SCD-1 view is not necessary but for testing purposes. Again, this adds complexity.

 

Conclusion

The concept of the slowly changing dimensions belongs to the fundament of BI data modeling. However, keeping historical values using type 2 (SCD-2) may have some negative side effects and raise the complexity of your BI system. So it’s a good advice to consider handling historical changes carefully and to be fully aware of those side effects.

Sunday, December 16, 2012

Hierarchies with text fields (attribute relationship)

SQL Server 2005-2012

When setting up hierarchies for multidimensional SSAS attribute relationships between the levels are highly recommended. There might be scenarios in which the source data breaks those attribute relationships and usually this can be solved by choosing a proper key, in some cases combined keys. I wrote about this topic in earlier posts (see here and here).

In general you are in a good situation if the attribute relationship is modeled as a 1:n relationship between each corresponding level in the relational model of the source system. In this case you can rely on the keys of those tables. For example, in AdventureWorks we find tables like Product, ProductSubcategory and ProductCategory for the product hierarchy. Each table is a foreign key relationship to the next level, so we can be absolutely sure that we have true 1:n relation between products and their sub category and between the sub category and the category (using the keys from the relational databases). In our attribute relationship we can rely on those keys and no matter how the content of text fields changes in these tables, the attribute relationship for our hierarchies will still work.

However, and this is the topic of this post, you might be in a situation where the source systems only delivers some text fields and you have no information how these fields are linked in the source system. Maybe there isn’t any link at all and the text fields are nothing but additional columns at the same grain as your dimension key. If it’s a requirement to build a hierarchy using these text fields and if you want to set up attribute relationship correctly, changes in the source data might break your attribute relationship and therefore cause the dimension processing to fail, if you use the text fields as attribute key for these attributes.

To illustrate this situation, let’s take a look at the following data table which is just an example of an interface file (maybe a csv-file):

image

The requirement is to create the following drill down path (hierarchy) based on this data:

Manager —> Sales Territory —> State —> City —> Office

Checking the existing rows of data, the attribute relationship can be set up identically to this drill down path, as all of the following relationships are one-to-many:

  • Office Id to city
  • city to state
  • state to sales territory
  • sales territory to manager

So you could be tempted to create the attribute relationship based on those text fields and to use the text field itself as the key for the given attribute. The dimension will process correctly and everything seems to be fine. But in this case, with no proper keys from the source system, we have no guarantee at all, that the cardinality between those attributes will always be like this. For example, let’s assume, the following 3 new office are established:

image

The first row is a second office in New York. This line does not break any of the above relationship settings. But office 7 is managed by Linda Randall instead of Barbara Moreland. So the relationship between manager and sales territory is no longer 1:n but now n:m. And office 8 is newly-found in Florence, Kentucky, which is different from Florence, Alabama and therefore breaks the relationship between state and city. Within our four attribute relationships from above only two are still valid after inserting the rows for office 6-8. It may be possible to process the dimension correctly for some time, but at any time there is the risk that rows show up which break the relationship (offices 7 and 8 from the example above) causing the dimension process to fail with a duplicate key error.

The actual error will be somewhat like this:

Errors in the OLAP storage engine: A duplicate attribute key has been found when processing: Table: 'dbo_Office', Column: 'City', Value: 'Florence'. The attribute is 'City'.

You will find more information about his kind of error and also about other possible causes for this to happen here.

So, how can we prevent this to happen and how can we immunize our hierarchy against changes in the data? The solution is quite simple. If we don’t get reliable keys from our source system, we have to compute them by ourselves. One idea might be to use combined text fields, which would work, but for performance reasons we prefer integer keys for the attributes.

The easiest way to create these keys, is to establish a lookup table for each attribute as shown below. For each attribute, we build a look up key that contains all dependent attributes (from our hierarchy) and compute an artificial key (maybe by using an identity column). In our example, for the city, we have to include city, state, sales territory and manager in the lookup. I’m using a simple text field in this example, but in general a hash code (like MD5) would be a better solution.

image

Whenever a look up key is not found, a new row is inserted, thus resulting in a new key.

Using this look up table, we can enrich our original data with reliable keys as shown below:

image

And no matter how the data changes in our source table, we will still be able to create attribute keys that fulfill the 1:n requirement for attribute relationship.

Ok, I’m hearing you say, that this was an easy case. What can I do if I have multiple parallel hierarchies instead of only one? How can the keys be computed in such a case. Well, maybe you have already guessed it: Usually, if there are no reliable keys generated from the source system, there is no work around but to create the attributes per hierarchy. For example, if the users also wanted a simplified hierarchy like

Office—>Sales territory

we end up with a second attribute (column) for the sales territory. If you don’t create separate attributes you might end up with duplicated entries confusing the user of the hierarchy. This happens because elements in our dimension/hierarchy are shown by key. With a single column, our full hierarchy from above would be correct, while the simplified hierarchy would happen to look like this:

  • Southeast
    • Office 5
    • Office 8
  • Southeast
    • Office 7

The reason for this is that using the hierarchy key, we end up the two different versions of “Southeast” (one for manager Barbara and one for Linda, in order to preserve the 1:n relationship to the manager). But of course, what the end user would expect is somewhat like this:

  • Southeast
    • Office 5
    • Office 7
    • Office 8

This explains, why we need separate attributes (and separate keys) for each of the hierarchies.

And again, all of the above is only necessary if we cannot fully rely on the 1:n relationship.