Search This Blog

Saturday, March 30, 2013

What’s the buzz about MPP Data Warehouses (part 1)?

PDW v1/v2

In the context of more and more data and the need of being able to analyze this data, you might also have stumbled over the MPP approaches for large data warehouses. MPP stand for massively parallel processing in contrast to SMP which means symmetric multi processing. A good definition both worlds can be found here: In an SMP machine, you usually have multiple CPUs which are sharing memory (RAM, disks) and are therefore well suited to boost performance on CPU critical tasks while in an MPP machine you also have multiple CPUs, but this time, each CPU has it’s own memory. Therefore MPP systems are better suited for a workload where you need a very high throughput of data. And this is what we typically see in data warehouses. Here we need to load large amount of data and we need to efficiently query large portions of this data.

But wait, comparing DWH and OLTP, I’m thinking of the following situation:

image

The main difference between an OLTP and a DWH solution is the data model and not the underlying hardware or database server software. Or, in other words, a modern database server should be suitable for both work loads, OLTP and DWH. The data model (database layout) however, differs a lot: In case of an OLTP database, we want to reduce redundancy and therefore build the data model as by normalizing the data to transactional and reference data and potentially complex relationship between tables. On the other hand, in a DWH model we want to be able to read large amounts of data with simple queries and therefore prefer a de-normalized model (star schema).

And as long as we do not have too much data, this point of view works fine. It’s surprising that it does, as both use cases have different requirements to the underlying infrastructure.

OLTP-System

  • Usually small batches of data (transactions), usually structured in a complicated way (covering many tables)
  • Needs to be able to roll back changes spanning multiple tables (complex transactions)
  • Ensure data integrity (foreign keys, other contraints)
  • Ensure simultaneous read/write access of users to the same amount of data (isolation level)
  • Support rich programming features (triggers, user defined functions etc.)

DWH-System

  • Load large amounts of data at specific loading times
  • Query large amounts of data (often in “full scan”), create aggregates

But still, for small amounts of data, you don’t have to consider the infrastructure too much. However, as the amount of data and its complexity grows, you have to think about ways of optimizing your data warehouse architecture:

image

The first step is to apply best practices for your data warehouse model. For example, loading large amount of data is not a good idea if you are having active foreign key constraints or – even worse – triggers – on your tables. But having a feature in a database software, does not necessarily mean, that you have to use it. So, here are some of these best practices:

  • Avoid active foreign key constraints when loading a large amount of data
  • use table partitioning and partition switching for updates rather than individual row insert/update processes (for example: late arriving facts)
  • Avoid granular transaction logging (simple recovery model)

On this step, you didn’t really touch the data warehouse system infrastructure at all. So your database server is still “universal”. On the next step of complexity, usually we start tuning the machine itself, for example

  • Choose a specific layout for your IO (SAN, RAID)
  • Choose a specific distribution of database files and file groups (log, temp etc.)
  • Use specifically tuned machines for the different tasks, like staging, ODS, data warehouse, data marts
  • Use server clusters to balance workload and provide high availability

At this step, the SQL Server becomes more and more optimized for data warehouse workload. It will be possible to run OLTP workload too, but this maybe less efficiently, as we started to optimize for DWH workload.

However, as the amount of data grows, one question comes in to mind: Wouldn’t it be better, to really optimize the database server for DWH workload? And consequently, don’t consider OLTP requirements as we do this optimization? This will offer different ways of storing and handling the data. If we follow this path, we get an infrastructure that might not be suited for OLTP traffic at all, but perfectly supports large loads and fast reads of very large data.

image

MPP data warehouse solutions, like Teradata, Oracle Exadata, IBM Netezza and Microsoft Parallel Data Warehouse or Greenplum are examples of these approaches. Usually, the approach is a shared nothing MPP architecture of nodes, which have their own segment of data on their own disks (not a shared memory or disk). Most consequently all components (including the hardware) are perfectly tuned and aligned for this purpose. To achieve this, pre-installed and configured appliances are commonly used, so instead of buying hardware and software individually and trying to make it run well and fast, you get a “black box” (i.e. one or more racks) of components and software that are selected and configured in the best possible way.

In part 2 of this post, I’ll show the basic ideas of this shared nothing architecture and how query performance can benefit from the distribution of data on several compute nodes.

Sunday, March 24, 2013

When data gets big

BI

So this post is about big data. When looking around on the internet, you can find amazing examples of big data. For example, the Large Hadron Collider of the CERN generates about 15 Petabytes of data every year (see http://www.lhc-facts.ch/index.php?page=datenverarbeitung for details). This is about 41 Terabyte each day. Impressive. However, you might argue that you don’t have such a collider in your company. In fact, most companies will only have to deal with a very small fraction of this amount of data. So where does big data start for common business applications? And what does it mean for the IT strategy. Does it have an influence or is it just a matter of scaling and improving systems – a process that we always have to do in IT to keep up with business requirements.

Wikipedia defines big data as “a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications” (http://en.wikipedia.org/wiki/Big_data). And I think that this definition is a good starting point because it focuses not only on the amount of data but puts that amount in relation to what we do with the data (process). Let me paraphrase this definition:

I’m talking of big data if my analytical capabilities are no longer sufficient because of the amount or complexity of the data and if I’m not able to scale this capabilities using traditional approaches (more RAM, more servers in my clusters, more disks etc.). It’s not difficult (and not expensive) to store Petabytes of data. It’s difficult to process that data and to do analytics on this data and to gain insights.

So, to be honest, even a few hundred million rows of data may be big, if I’m not able to perform the important analytics, that I need to supply my core business processes in a timely manner. And there are two things to keep in mind:

  • Modern ideas of modeling markets and complex statistical models and methods are available.
  • While it may be difficult to apply these methods, maybe our competitors already do.

Also, another aspect about analytical capabilities is the question “Do I have the right data or do I need other data sources?”. Limits in analytical capabilities may also exists because I don’t have the information I would need. In todays world with lots of data markets (like Microsoft Azure Datamarket, http://datamarket.azure.com/), it’s reality that you can get information/data that you might not even have dared dreaming about a few years ago. Now you get data about consumer trends, your competitors or global trends and you get this data in a reliable, accurate and up-to-date way. Again, this increases the amount of data you need to process and by that, may worsen the analytical restrictions that result from the pure amount of data.

But then, this still is nothing new. As I mentioned before IT had to follow these requirements during each of the last years. We added more cores, bought newer machines. The database algorithms improved, we used OLAP and other technologies to speed up analysis. But let me get back to the second half of the definition from above: “it becomes difficult to process using on-hand database management tools or traditional data processing applications”. If you like, you may replace “difficult” with “makes it more expensive” or “costs more afford” or – in some cases – “makes it impossible, at least for the required period of time”. For example, if you want to calculate a complex price elasticity model in retail and it takes you a month to do so, the result will not be useful anymore as the situation in your market might have already changed significantly (for example because of your competitors’ campaigns).

Again, this is not really new. During the past you may have added other components in your IT strategy, for example OLAP. Or you have replaced a slow database solution with a faster one. And you focused on scalability in order to cope with these challenges. So, you might look at some typical components of a big data environment in just the same way. Here are some of them (be careful, buzzword mode is switched on now):

  • Hadoop, Hive, Pig etc.
  • In-Memory Computing
  • Massive parallel computing (MPP).
  • Cloud
  • Complex event processing (CEP)
  • Etc.

[buzzword mode off] However, if you think of these components in a traditional way of enhancing the IT infrastructure you might think in the wrong direction. The main thing about big data is, that when you get to the limits of your analytical capabilities (as in the definition from above) there are almost always tools and methods to get beyond those limits. However, these tools may require some fundamental changes in the IT ecosystem. As for MPP databases, for example, it’s not done getting one and putting all the data on it, but it is about re-shaping the BI architecture in order to match the new paradigm of those systems.

image

During the next posts, I’ll get a little bit deeper in this topic, the fundamental changes in the Big Data Architecture and especially MPP databases.

Saturday, March 9, 2013

Dashboards, Scorecards, KPIs… are they really useful?

Business Intelligence

When talking about dashboards, scorecards and KPIs sometimes a discussion emerges about the usefulness of all this. One common argument against these management tools is, that they are not flexible enough and that they cannot really cover the most important aspects of business. An important nature of a KPI is, that you can measure it. However, the important aspects of management seem to go far beyond measurable indicators. For example, the invention of a new kind of product or service or extending your business to another country or market. These are strategic decisions, ideas, inventions. How can we put them into KPIs? KPIs seem to be exactly the opposite to these strategic ideas. In fact, a typical thesis about KPIs is

“If your KPIs are green you can go golfing”

Almost every manager I talked to disagrees on that. There is no such thing like a KPI that indicates that there is no need for action. As we are in a competitive market, there is always the need to think about future improvements, new market strategies and plans. And even if somebody is determined to work with KPIs, the next discussion evolves about questions like “do they need to be real-time?”, “do I need them on my mobile device?”.

But let’s start at the beginning. We have to concentrate on the aspects of management before we can decide how these aspects can be supported by measures or KPIs. Looking at your task list, it should be possible to arrange all the tasks on the following board:

image

While there is common agreement that management should delegate topics in squares 1 and 3 (not important), the role of the other two squares is essential. To make it short, the square which deserves the highest focus by management is square number 4, tasks which are not (yet) urgent but very important. It is this square where you find the invention of new ideas, new products, the development of your staff etc. It is so important because this will ensure your position in the future. If management doesn’t work on these topics, nobody does. However, in daily business it’s square number 2 that often disturbs this creative process.

With a little bit of simplification we can concentrate the work of a manager on those two squares:

Square 2: Operational aspects

Square 4: Strategic development of the operational unit

Looking at those two squares you find tasks that are very different:

Operational Strategic
Repeating, well defined New, inventive
Urgent, high impact on operational processes Not urgent, but will become crucial in the future
Clearly defined measures Hard, if not impossible to measure

Of course, this is a very broad simplification, just to make the point. Strategic decisions lead to tactical tasks which can then be tracked and measured and/or detailed to operational decisions.

KPIs can easily be set up for operational or tactical topics. And for these two aspects it’s also important that they are near-time, always available and support some kind of push alerts. Think of a manager being responsible for a certain step in production. Strategically the manager will think about better ways of doing this production step, maybe methods which use less energy or improve production for a higher quality. But if there is a failure in the machine and production stops because of this, we have an escalation which deserves the full attention of the manager.

Therefore, I would like to modify the statement from above to

“If your operational KPIs are green you can take your time to think about the strategically important aspects of your business

In this sense, KPIs are very useful to get a quick overview about the operational topics you are responsible for. Such KPIs could cover production output, number of complaints in a call center, quality measures, automatic testing procedures etc. And now imagine you have these KPIs on your mobile device in real (or near) time. So you are informed about operational problems long before any of your managers notice them. This enables you to take counter measures and to provide answers instead of questions. Having such KPIs gives the power to act instead of reacting. And once the KPIs go back to green again you can concentrate on the strategically important aspects of business.

I’m not saying that KPIs don’t make sense for the strategical topics at all. But they are on a more long-term basis. Having them on a mobile device for example doesn’t give much benefit. If they are not urgent you can also monitor them the next time, you are at your desk. With operational KPIs you might not have this time.

So, having access to operational KPIs and alerting is important for management. How can we put these KPIs on scorecards and dashboards? One common mistake is the believe that these KPIs are “static”. Sometimes dashboards are created using reporting tools. People decide which KPIs are important and those are developed into a “dashboard”. Things are even worse, if developers are needed in order to change KPIs. For example, think of a product manager being responsible for a new product.

Situation KPIs needed
The product needs to be developed in time for the exhibition in fall. Only six months to go… Development progress of the product
Development progress of marketing material
Six months later, at the exhibition, there is a lot of interest in the new product. Many people are placing orders Monitor the delivery time of the product
Monitor the production rate of the product
Two months later, there is a unusual high rate of complaints. Obviously, there are quality issues. After investigation, the source of those issues is found. Monitor the complaint rate
Monitor the guarantee process
plus:
Monitor the delivery time of the product
Monitor the production rate of the product
Again four months later: The quality issues are solved, but sales are decreasing. The reason is that a competitor has developed a similar product at a lower price. Your answer is increase marketing and to reduce the price while working on new features in the product to come up with a more competitive product soon. Monitor the marginal return of the product
Monitor sales rate
Monitor development of new features
Monitor competition

I think it’s clear, that a useful dashboard/scorecard has to adapt to the needs of business in a timely manner. While the dashboard of a car is simple and mostly static, a business dashboard is not. If such a dashboard is developed using reporting tools it’s likely that your users will only use it for a few weeks or months. After that, it doesn’t reflect the business needs any more and needs to be adjusted. Even for car dashboards we find more flexibility nowadays. If you’re running out of fuel, the car’s gps gives you advice on nearby gas stations. When approaching your destination, the car informs you about parking facilities. So even the static dashboard of a car becomes more and more dynamic and case sensitive. Why should a less flexible dashboard work for business?

 

Conclusion

KPIs, Scorecards and Dashboards based on operational measures are adding a true benefit to the work of a manager. If a KPI turns red, immediate action is necessary. Even if the situation is already under control, the manager needs to be informed and potentially needs to inform others. Therefore, these KPIs have to be real time / near time and have to be available everywhere (mobile devices). And having these KPIs in status “green” does not mean, the manager has nothing to do, but that the manager has time to concentrate on the most important aspects of work: The strategic development of the operational unit and it’s staff members. But it’s important that these KPIs are flexible – as flexible as business is. Therefore reporting tools are not a good idea of building dashboards unless the set of KPIs can easily be adjusted using self-service tools. A more promising approach is the use of Visual-BI tools, that allow end users to easily adopt dashboards and visualizations without the delays resulting from round trips to IT and development.

Saturday, March 2, 2013

DAX vs. MDX: An example

SQL Server 2012

I’m often asked about the key differences between DAX and MDX or in more general, the difference between the tabular and the multidimensional model. From my presentation on the PASS Camp 2012 in Germany I’d like to share some of the ideas here.

From the perspective of an expression or query language, one of the most important differences is the concept behind both approaches.

For a cube, we have the concept of a tuple for addressing a cell in the cube space. The axis in the tuple are setting the coordinates. If we have a single tuple, the result is the content of the corresponding cell in the cube. As cube attributes have an All-member which serves as the default (in most cases) if the attribute is omitted from the tuple, we also get aggregated this way. For example, a tuple like

(Date.Calendar.[Calendar Year].&[2013], Measures.[Internet Sales Amount])

returns the (aggregated) sales amount for the year 2013. Other attributes (for example Product) are on their default members. As you see, there is no need to supply an aggregation function in this case (although MDX has such functions when aggregating values over sets) as the cube contains the definitions about how to aggregate the sales amount. The following sketch illustrates this way, of addressing a value in a cube:

image

For the tabular model, filters within our pivot table work like filters on the underlying tables. Even with a single selected value, multiple rows can be included in the filter of a table. For example, if you filter the Year the 2013, the underlying date table will be filter to all 365 days of that year. For a given measure all filtered tables are intersected giving the detail values that correspond to the filter. As there are potentially many rows of data after this process, an aggregation function is required to compute the aggregate. This is shown in the following illustration:

image

If you are more familiar with SQL than with MDX, the concept of filtering and aggregating in DAX will be more familiar to you. In SQL, as in DAX, we’re usually restricting table rows (using the WHERE clause in SQL or the FILTER function in DAX). Then we create groups (using the GROUP BY clause in SQL or the SUMMARIZE function in DAX) and finally we’re computing the aggregates using an appropriate aggregation function (like SUM).

However, many questions that require this process in SQL or DAX can be solved in MDX just by addressing cells. Let my provide you with an example that I frequently use during my MDX trainings: The requirement is to create a computed measure that gives the sales amount of sales on weekend. If you have a strong SQL background your solution in MDX may look like this:

Aggregate(
    filter(
        descendants(
            [Date].[Calendar].currentmember,
            [Date].[Calendar].[Date]
        )
        ,
        [Date].[Day of Week].currentmember IS [Date].[Day of Week].[Sunday]
        or [Date].[Day of Week].currentmember IS [Date].[Day of Week].[Saturday]
    )
    ,[Measures].[Internet Sales Amount]
)

This seems to be a natural approach. Using the descendant function we create a set of all the dates on the selected date member (for example a month, a quarter, a year). We then filter those dates using the filter function, so that only Sunday and Saturday remains in the set. Finally we aggregate the measure sales amount along this set.

Actually this way of approaching this calculation is very similar to SQL or DAX. For example in DAX we would write the same calculation in almost exactly the same way:

evaluate(
    summarize(       
        filter(
            'Internet Sales'
            , related('Date'[Calendar Year])=2007
        )                   
        , 'Date'[Month]
        , "Umsatz"
        , Sum('Internet Sales'[Sales Amount])
        , "UmsatzWE"
        , Calculate(
            Sum('Internet Sales'[Sales Amount])
            , Or(
                'Date'[Day Name Of Week]="Sunday"
                , 'Date'[Day Name Of Week]="Saturday"
            )
         )
    )
)

Although this looks very similar to the MDX code from above, the MDX code above is close the most complicated solution available. Since the weekday is a cube attribute, we can simply address sales on weekends but using a tuple (ok, a sum of two tuples):

([Measures].[Internet Sales Amount], [Date].[Day Name].[Sunday])
+
([Measures].[Internet Sales Amount], [Date].[Day Name].[Saturday])

So when writing a DAX query, we rather think

  • How do I filter my underlying tables
  • What aggregation function do I need

In MDX on the other hand, we rather think

  • What axis do I have to address in order to pull my value out of the cube