Search This Blog

Monday, May 24, 2010

Self-Service BI, PowerPivot and the future of traditional BI (DWH, OLAP, MDX)

SQL Server 2008R2 | Excel 2010/PowerPivot

Since I’m using the recently released Microsoft PowerPivot Add-In for Excel 2010 and when reading the rumors about the future of traditional OLAP and MDX there are some questions about the big picture of a BI-environment including self-service functionality. Basically, the BI world of the past had excepted the idea of a central data warehouse having a meta data layer such as OLAP to perfectly present the information to the end users. What about this new player PowerPivot then? How does this fit into the picture? Is there still a future for things like OLAP, MDX, central data warehouses or do we only need to roll out self-service BI functionality to every desktop PC? Some people have asked me questions about my point of view here and although I’m not a Microsoft representative I’d like to share my personal opinion with you:

Self-service BI tools are not and will never be a replacement for traditional BI-systems but a great enhancement for them.

image_thumb1 Simple dashboard build using only Excel 2010 with PowerPivot. No centralized BI needed anymore?

In other words, including self-service BI functionality to your BI-system will increase the possibilities and the analytical power of the end users. But if you are cutting costs for centralized BI solutions believing to replace them with self-service tools, you’ll end up with even higher costs as suddenly the work that has only been done once in the centralized BI system will be performed redundantly, in an inconsistent way, error-prone and with much more working time needed.

In order to understand this, let’s look at just some advantages, a centralized BI-system can offer, which cannot be replaced by a decentralized self-service BI tool like PowerPivot:

Combination of data from multiple source in a consistent and time-saving way
If this is done on a user per user approach, it is very likely that different users are getting to different results. Also, this work is highly redundant. Imagine different departments getting to the IT in order get “their” data exported, then trying to combine it into a single data store. Often enough, this job requires additional mapping tables (customers, articles etc. may have different ids in different systems). We are supporting customers with multiple ERP-systems (due to mergers) and the mapping can be quite complicated. But even if it is just a single source of information, mapping between different tables has to be done and requires skills and knowledge about the data models. For example, if you forget to consider a key field the result may differ significantly. Or think about the need to exclude rows with a certain status, because this means they’re cancelled. One end user might know about this, the other might not.
This means that you will still need your sophisticated ETL processes, a proper front room model, slowly changing dimensions and all the stuff we know from our typical BI projects.

Consistent use of common calculations considering approved business rules
In many cases, our current ETL processes include complicated business processes for doing calculations and data mappings. Key measures and performance indicators have to be calculated in a consistent way. Business rules are the backbone for the company’s information system. If different departments are comparing apples and oranges, there is a lot of space for confusion and wrong decisions. Think of a simple example, like reporting the revenue. Are warranty adjustments considered? What about partial payments or commissions? What about discounts (for example staff discounts)? For all those aspects it has to be decided to include or exclude them into a certain key measure and also which relation to the time dimension is correct. Is revenue for a partial payment considered as a sale (full amount) at the date of purchase or are the real payments (cash flow) considered? If every user has to make these decisions it is very unlikely that everybody is doing the same calculations. Comparing results for different departments (think of a sales meeting for the different product managers) will then get very difficult.

Security can only be implemented in a central data store
Some real world scenarios are currently looking like this: The IT department exports data for other organizational structures, manually filtering out the data that is not intended for the recipient. This might work with very simple security structures, but with more security roles, user dependent security or more information recipients, this would lead to an enormous amount of work for exporting all the data. And if there are changes to the companies security model, all the exports have to be considered again. Having a central OLAP solution makes it easy to define the security roles and access rights in a central place using the business view on the data. For example, in OLAP you can restrict a user to see the cost centers for which she or he is responsible – OLAP takes care about all related data (for example automatically filtering the cost facts to these cost centers). There is no need for a huge amount of data exports as users can retrieve the data and information they need and IT only has to make sure that the data is available and secured.

The need for management reporting
Management needs an overview about some or all business units. Having the information (especially the calculations, KPIs etc.) in a decentralized environment makes it very difficult to get this management reporting in a simple, time saving way. It is more likely, that IT has to do special exports which are then processed by the controlling department to build the management reports. This could result in controlling spending all the time in doing data management, not information management and controlling. Also, in this scenario, the data from the management reporting will most likely differ from the data of the departments. Just imagine the CEO going to a some product manager saying “Hey, you’re product profitability is –5%” and the product manager says “No, it isn’t. Look at MY report. Here it reads +3%” and actually neither of them could say which result was the right one…

Then, after all, if the central BI environment is so important, do we really need decentralized self service BI? Well, not every user will need self-service BI but for some it can be a real time saver or give them a lot of analytical power. Here are just two important scenarios:

  • The end user wants to analyze information by special properties of the data which are not present in the centralized data warehouse. Just think of product managers. Each product has a different target group, special conditions in the market and therefore potentially certain aspects that are different from product to product and therefore from product manager to product manager. Those information might only be relevant for certain products. Having this in a central data warehouse would be confusing as the information has no meaning for most of the products. Allowing each department to cover the specific needs of their work while still providing the central information being available in the data warehouse is the best way here.
  • The end user wants to combine the centrally provided information with other sources of data, for example information that has been purchased/acquired from external sources and which is not complete in means of geography, time etc. Think of a marketing department planning a campaign. In order to do so, they want to analyze sales data in conjunction with external data for purchasing power. The external information was only purchased once for the region where the campaign is planned. This kind of data cannot be loaded into the central data warehouse. But with self-service BI it can be analyzed side-by-side with the centrally provided sales data.
    Or think of one department trying to improve product quality by changing some of the parameters during the production cycle for some of the production batches. These changes are tracked in some other system (let’s say Excel) but not in the central DWH as they do only apply to this single line of production. Self-service BI allows us to analyze changes in the parameters together with data from the central data warehouse side by side (for example quality control data in this case).

So my opinion is that the traditional centralized BI systems really benefit from self-service BI functionality. However, self-service BI can never replace traditional BI. But what about technical aspects, especially the future of MDX? MDX is the query language for multidimensional databases (used by many vendors). In PowerPivot we can do a lot of the calculations using the new expression language called DAX. Will DAX be a replacement for MDX? Absolutely not! DAX is meant to bring analytical power to Excel users. It looks similar to Excel functions and in fact many Excel functions can be used. Its strength is simplicity. Although one could imagine to extent the expression based DAX language to query functionality (in MDX you can write both, queries and expressions) this would also complicate the use of DAX which is clearly not intended. Even with today’s Excel, many users only know about the operators +, –, / and * and the SUM function (advanced users know about SUMIF…). In order to have end users, even power users, being able to leverage the power of a self service BI solution, the calculation functions have to be as simple as possible. This is the idea of DAX. However, when defining complex queries, building highly sophisticated business logic into calculations or KPIs, implementing ease-of-use like KPI trends, OLAP actions, drill through queries, navigation in hierarchies (DAX has no hierarchies) MDX has everything that’s needed here. Including this functionality into DAX would only make it complicated and more difficult to understand and use. Of course, this is only my opinion, but I’m sure that MDX will still be used for what it is used today and DAX expressions will be used for self-service calculations and mappings that are intended to be done by none-technical people.

When using client-side technology to create any kind of informational insight we have to monitor this process carefully. There has to be a process to maintain business requirements and implement changes to the central BI-system to avoid the negative effects mentioned above. As with PowerPivot, it is also possible to monitor which workbooks have been used and which data sources have been queried (if the PowerPivot sheets are published to Sharepoint 2010). I think this is also important to really understand, if the self-service BI tool is used in a way it is intended to be used or if some analysis requirements that should have been part of the central BI solution are now starting to be solved in multiple departments redundantly.

So again, we will see a co-operation between the centralized BI-system and self-service BI solutions, as well as between MDX and DAX. Self-service BI and DAX are not the fox in the chicken-house of traditional BI but they are extending the vision and scope of BI-systems of today and in the future.

Sunday, May 16, 2010

Analyzing the number of visits per customer

SQL Server 2008 | SQL Server 2008 R2

OLAP is perfect for analyzing fact records that are mapped to dimensions. However, what can be done, if the reference to the dimension changes depending on the selected period? I’m not talking about slowly changing dimensions here, but really about the period of selection.

Think of the following example: A manager wants to analyze the number of visited customers for his sales force. He wants to see, how many customers have been visited once, twice and so on during the last three months (or any other time period).

In order to do so, we need a dimension “Visits” (with the number of visits, for example 0, 1, 2, …) but we cannot map the visits against this dimension because this mapping depends on the selected period. For example, one single customer was visited once in January and once in February. If the selected period is January, the we need to see one customer being visited once. If the selected period is January and February together, we would need to see the customer being visited twice.

In SQL Server 2008 we can solve this by not linking our Visits dimension to the measure groups. The calculations are then performed by a cube script. For our example, we’re using a very simple data model:


The fact table FactVisits contains one row per visit (linked to date and customer). Note that the Visits dimension is not linked to the fact table. The source table for the visit dimension looks like this.


For the cube, we need to calculate the entries for the Visits dimension. To do so, I created an additional measure (Customer Count) based on some other value I found in the source fact table (the value will be overwritten using the cube script later). Usually you would create an additional column in the data source view (value 0) to source this measure from. Here is the definition of my Customer Count measure:


In order to calculate the number of customers for each visit count, I created the following cube script:

scope ([Visits].[Visit].[Visit],[Customer].[Customer].[Customer],[Measures].[Customer Count]);
this=iif(([Measures].[Visit Count])=CInt([Visits].[Visit]"KEY")),1,NULL);
end scope;

The script is computed on leaf level of the visits (visit count dimension). In the cube script, the reference to [Visits].[Visit]"KEY") results in the VisitID (number of visits). If the number of visits (measure [Visit Count]) is equal to the number of visits found in the visit dimension, this is counted as one. Note, that I used NULL instead of 0. This makes it easier to analyze, which customers have been visited for a given number of times, as null values can be suppressed (see last screenshot of this post).

So, let’s check the results up to this point.


The calculation now works for every single line but currently the totals are still wrong. The correct value for the totals can be computed using a dynamic set (to allow Excel multi selects) like this:

([Visits].[Visit].[All])=Sum(existing [DynaVisits]);

I also want to count more than ten visits on a special dimension element (ID 999 in the table above). This can be achieved with the following cube script:

scope ([Visits].[Visit].&[999],[Customer].[Customer].[Customer],[Measures].[Customer Count]);
this=iif(([Measures].[Visit Count])>10,1,NULL);
end scope;

Testing the results (here using Microsoft Excel) shows the correct calculation for the grand total as well as for other levels of the visit count hierarchy:


The way our calculations works can be best observed when looking at a single customer.


In this case we had two visits of our customer in 2008 and one visit in 2009 giving a total of three visits for 2008 and 2009. Note that the actual fact is shown as a dimensional property here.

We can also apply multi select filters in Excel. Here is another sample screenshot filtering only customers with less than 4 visits during the last three months in 2009 (this is the ‘(multiple items)’ filter for the calendar). The data is now also analyzed by the sales representative (attribute of the customer). Note, that the grand totals are still correct:


As mentioned before it is also possible to see which customers are listed here by including the customers in the pivot table. This is shown in the following screenshot (filter to visit count = Zero).


However, the drill through action would still return all customers (due to the definition of my measure) as the value is only based on a computation. So the drill through should be disabled here.

In this case, I simply checked the measure in order to enable the drill through action:


Of course, the example is still a very simplified one. Usually you would need to know the customer base, so you’re not counting new customers as not being visited during the last years.