Search This Blog

Saturday, July 30, 2011

Parallel hierarchies in a parent-child structure

SQL Server 2005 | SQL Server 2008 | SQL Server 2008R2

This post is about a problem I faced some years ago. The source system was SAP with user defined hierarchies, in this case within the cost center and cost type tables. Parallel hierarchies are well supported in SQL Server BI but in this case, users in SAP could define multiple hierarchies on their own and they wanted these hierarchies to be also available in the OLAP cube. For example, costs associated with the cost center 1000 should be analyzed as shown below:

hierarchy

All costs that are booked on cost center 1000 have to appear in the hierarchy as shown in the sketch. And end-users may also be able to create new hierarchies (for example to analyze a certain project). Of course there may be better ways to model this but in this case we had basically two tables for the cost centers:

  • table CC (Cost Center)
  • table CCG (Cost Center Group)

Table CC contains all cost centers (for example the above cost center 1000) together with additional information (like name, responsible person etc.) while table CCG contains the hierarchy. in CCG we basically find two columns:

  • Node name
  • Parent node name
Node Parent Node
1000 Internal_HR_DE
Internal_HR_DE HR_DE
HR_DE Germany
1000 Marketing_DE
Marketing_DE Germany
Germany Corporate
Marketing_DE Marketing
Marketing Corporate

Facts (in this case actual or planned costs) are associated with the cost center number (for example 1000). Usually, parent-child hierarchies may be used in this case where we have a very dynamic structure and we do not know the number of levels. However, parent-child may only be used if each node has at most one parent. But here we find the cost center 1000 having two parents (Internal HR Costs DE and Marketing_DE). The same situation exists with the Marketing_DE node (having parents Marketing and Germany).

The solution I’m presenting here is to create additional nodes until each node only has one parent. This is possible as each node of a parent-child hierarchy in SSAS has a name and a key property. So, the name will be identical, while the key will be different. In order to show the process, let’s add internal keys to each of the hierarchy elements.

hierarchy1

What we have to do now is to create additional nodes for every node that has more than one parent. Let’s start with the ‘Marketing_DE’ node:

hierarchy2

The additional node gets a new (internal) key, in this example the number 8. But there is still a node with multiple parents: the cost center 1000. Let’s also transform this into separate nodes:

hierarchy3

After this step, each node has at most one parent and therefore the structure can be modeled as an SSAS parent-child hierarchy. 

But now, we have to think about the fact rows. Without the hierarchy, facts would have been associated to the cost center by using the internal key (surrogate key), so for example 1000 € that are booked on cost center 1000 would appear in the fact table like

DateKey CostCenterKey Amount
20110630 1 1000

But now, we have to associate this single fact row to three rows in the dimension table (as the cost center 1000 appears three times now). Therefore we have to use a many-to-many approach, so we add another table, a so called bridge table with the following rows:

CostCenterKey CostCenterDimKey
1 1
1 9
1 10

For technical reasons, our fact table has to be linked to a dimension (of flat cost centers), which is also used by the bridge table. This is shown in the following image:

hierarchy4

The most difficult part here is to “normalize” the parent-child structure. One way to do this is to use a stored procedure. Here is the code I used. Within this procedure, the following tables are used:

masterdata.Costcenter the flat table of cost centers (only leaf-level). The key field is the cost center number (for example 1000 for our cost center from above)
masterdata.CostcenterGroup the hierarchy structure as shown above
ods.CostcenterGroupExpanded Output table: the expanded tree containing the fields of the table masterdata.Costcenter plus the following additional fields:

CostcenterKey the new generated surrogate key
ParentKey the key of the parent node
Level a technical field used during iteration

Here is the code:

CREATE PROCEDURE [dbo].[ExpandCostcenterGroup]
AS
    SET NOCOUNT ON
    truncate table ods.costcenterGroupExpanded
   
    declare @level int
    declare @affectedrows int
    declare @totalrowcount int
       
    set @level=0
   
    insert into ods.costcenterGroupExpanded(costcentergroup,Parentgroup,Description1,Description2,Responsibility,AccountingArea)   
     Select distinct costcenterGroup, Parent,Description1,Description2,Responsibility,AccountingArea from masterdata.costcentergroup
       
    -- Initialize all keys   
    update ods.costcenterGroupExpanded
    set ParentKey=(select min(costcenterkey) from ods.costcenterGroupExpanded where costcenterGroup=c.Parentgroup)   
    from ods.costcenterGroupExpanded as c
    where not c.ParentGroup is null       
   
    set @affectedrows=1   
   
    while @affectedrows>0
    begin
   
        Set @level=@level+1
       
        set @totalrowcount=(select Count(*) from ods.costcenterGroupExpanded)
   
        insert into  ods.costcenterGroupExpanded(costcentergroup,Parentgroup,ParentKey,"level",Description1,Description2,Responsibility,AccountingArea)
        select distinct cparent.costcentergroup, cparent.Parentgroup,cparent.costcenterKey,@level,cparent.Description1,cparent.Description2,cparent.Responsibility,cparent.AccountingArea
        from ods.costcenterGroupExpanded as cparent inner join ods.costcenterGroupExpanded as cchild on
        cparent.Parentgroup=cchild.costcenterGroup
        where cparent.ParentKey!=cchild.costcenterKey
        and cchild."Level"=@level-1
       
        set @affectedrows=@@rowcount
       
    end
   
    return

GO

To keep things simple, I truncate the output table CostcenterGroupExpanded here. However, there is a drawback with this approach: The surrogate keys may change after changes of the imported source tables. This will result in a problem for example for the Excel users. If you’re using filters like ‘show this element only’, only the key is stored.

In order to avoid this you will need to store the mapping and the assigned surrogate key separately. Here it is necessary not only to store the combination of cost center/parent cost center/surrogate key but the whole branch up to the root instead. If you look at the example above you will find two entries of ‘Cost Center 1000’ –> ‘Marketing_DE’, so this is not unique. You have to store the full path up to the root for each node (not only for the leaf-nodes) to make it unique:

Path Given Surrogate Key
1000 –> Internal_HR_DE –> HR_DE –> Germany –> Corporate 1
1000 –> Marketing_DE –> Germany –> Corporate 9
1000 –> Marketing_DE –> Marketing –> Corporate 10
Marketing_DE –> Germany –> Corporate 7
Marketing_DE –> Marketing –> Corporate 8
 

In order to store the full path up to the root level I recommend using a hash code (MD5 for example) as this is easier to handle as a long list of node names. In this case our additional key store table would look like this

PathMD5 Given Surrogate Key
417913d10ef49f5ff90db9db9f3d2569 1
8e27be6b156a52016e01dc049bc39126 9
52b1bcaec016e09d4086f37e63814aa5 10
 

The sample code above does not manage this key store table so the keys may change a lot on each load. But for practical purposes you will have to add this key management to make sure the same node always gets the same key.

Sunday, July 3, 2011

Same measure in different granularity

SQL Server 2005 | SQL Server 2008 | SQL Server 2008R2

In an earlier post I wrote about handling different granularity in a dimension-fact relationship. This time I want to get back to this topic from the end-user perspective. To illustrate this, I use a very simple data model here:

Unbenannt5_thumb3

For this model, we have daily revenue data and monthly revenue plan data. Here is the link from the measure groups to the dimensions:

Unbenannt6_thumb1

In this example, both measure groups only have one measure: revenue. Since measures are a dimension of its own in the cube, measure names have to be distinct, so I named the measures “Revenue” and “Plan Revenue”.

Unbenannt7_thumb1

If your cube contains more data from other measure groups, things can soon be getting confusion with many measure groups. This post is about two common solutions in order to reduce the number of measure groups here.

Two approaches are shown here:

  • Combining both measures in one measure group by using a calculated measure
  • Introducing a scenario dimension

Before we start, one remark. The last post was about using translations in the cube the add more flexibility regarding the naming of the measures. Since the measures are clearly associated with a measure group you could be tempted to use the same name for both measures (which is possible for translations). This is how the result would look like in the cube browser:

Unbenannt8_thumb1

However, when using a Pivot client like Excel you usually don’t see the measure group anymore. A simple analysis could look like this:

Unbenannt9_thumb

So from your Pivot table you cannot tell which is the actual and which is the plan revenue. Therefore I recommend not to use the same name for different measures in the different measure groups.

Combining both measures in one measure group by using a calculated measure

One possible option is to add a calculated measure the the Revenue measure group. In order to do so, I renamed the plan revenue to “Plan Revenue Internal” and set its visibility to hidden. Then the calculation could be as shown below:

Unbenannt10_thumb

This is how the cube looks like in the cube browser:

Unbenannt11_thumb1

By “copying” the plan measure into the revenue measure group you only have one visible measure group left, while both values are still separated by the name.

As I started the post about the different granularity, how does this look like at the detail level? The following screen shot shows the date dimension at the day level:

Unbenannt12_thumb3

Assuming you have set the “IgnoreUnrelatedDimensions” property of the plan revenue measure group to false, only actual values are displayed here, no plan values. So this approach works really well.

Introducing a scenario dimension

Another common way for implementing this situation is to use a scenario dimension. The scenario dimension contains two entries: Actual and Plan. The data from the fact table FactOrder is linked to the scenario member ‘Actual’ while the data from the FactPlan table is linked to the scenario member ‘Plan’. This can be easily done in the data source view (DSV) as shown below for the plan table:

image_thumb

The fact tables are linked to the scenario dimension in the data source view:

t1_thumb

The scenario dimension should be set IsAggregatable=False (as it makes no sense to aggregate actual and plan data). Also we should provide a default element. This is shown in the screenshot below:

image_thumb1

For the cube, we only want to have one single measure revenue which is neither of the two existing measures. Therefore we make both existing measures invisible. In order to distinguish them from visible measures, I prefixed them with an underscore:

image_thumb2

Both measures have their Visible-property set to false. The only visible measure in this case has to be a calculated measure that takes the data from one of the two invisible measures depending on the chosen scenario. Since our scenarios are distinct and not aggregatable, we can simply add both measures (as one of the two is always zero):

image_thumb3

Choosing this model for implementing the different granularity, the cube shows only one measure:

image_thumb6

If the scenario dimension is not used, actual values are shown. You can still easily display both values by putting the scenario dimension on one of the axis:

image_thumb5

If you compare this screenshot with the one from above you will see the same values, only the presentation is a little bit different (first option had two measures, last option one measure and a scenario dimension).

Summary

Both options shown here enable us to have the same measure in two or more measure groups of different granularity. The second option with the scenario dimension looks a little bit more tidy from a technical perspective. However, sometimes the first option is easier to use, especially when you add other calculated measures with formulas that are combining two or more scenarios (for example a measure like ‘Actual to plan ratio’). For those measures it’s not possible to assign them to one scenario. On the other hand, choosing the scenario dimension can dramatically reduce the number of visible measures in your cube, which makes the cube easier to understand for end-users. So depending on the requirements and the structure of the data, one of the two options will be the best to choose.