Search This Blog

Loading...

Sunday, May 19, 2013

Reminder for PASS SQLSaturday #230 (Germany)

Don’t miss the SQLSaturday on July 13, 2013. You can find more information about the location, registration and the agenda here:

http://www.sqlsaturday.com/230/

Also take a look at the agenda of the Pre-Conference (July 12, 2013) with interesting speakers and presentations: Here is the link for the details and the registration:

http://sqlsaturday230.eventbrite.de

Saturday, May 18, 2013

Connecting to PDW from PowerPivot

PDW v1/v2

When connecting to a PDW from PowerPivot, there might be some confusion about what to enter as the server name.

Confusion might even start at the very beginning when choosing the proper external data connection. After opening PowerPivot, you might assume to find the PDW connection behind the “From Database” ribbon icon. But you have to choose “From Other Sources” instead:

image

In the following dialog, choose “Microsoft SQL Server Parallel Data Warehouse” as the database type:

image

Next, a dialog opens to enter the connection details.

While you can choose any name you like for the connection name, the server name may be confusing as you usually enter a host name or IP address here. But for the PDW you have to enter the path to an IDS file containing the connection information:

image

How does such in IDS file look like? Let’s take a look at an example:

[Provider]
ProviderName=Microsoft SQL Server MPP OLE DB Provider
clsid={7D5C1E01-747C-4f39-8BEF-A88133706917}
[DSNInfo]
Description=MyPDWConnection
[Properties]
Host=192.168.27.23
Port=17001
Database=AdventureWorksDW
UseLDAP=0
DistinguishedName=
Encrypted=0
LoadBalancing=0
ConnectionRetryCount=0
ConnectionRetryDelay=3
AlternateServers=
DriverCompatibility=0
LogonID=mylogin
Password=mypassword
StatementFailover=0

Be sure to enter the correct IP address and port (default is 17001 for the TDS and 17000 for the Sequelink, but TDS is much faster), as well as the correct user/password.

Sunday, April 28, 2013

Distributed or replicated table? And what is important when choosing the distribution key?

PDW v1/v2

For large tables usually we’re looking for tuning options like creating an index or having a good partition strategy in place. For the Parallel Data Warehouse (PDW) additional decisions have to be made for the table layout.

Distributed or replicated?

The first decision is about the way the table is stored on the compute nodes. There are two options:

  1. Replicated
    All data of the table is available on all compute nodes
  2. Distributed
    The data of the table is distributed between the compute nodes

Distributing table data between the compute nodes follows the real nature of the MPP system. In PDW this distribution is established using a hash function on a specific table column which is referred to as the distribution key. In the other hand, replicated tables have their full content available on every compute node.

Creating a table in one of the two modes is quite easy:

Distributed table Replicated table

CREATE TABLE MyTable(
    ID int NOT NULL,
    ...
)
WITH (DISTRIBUTION = HASH(ID))

CREATE TABLE MyTable(
    ID int NOT NULL,
    ...
)
WITH (DISTRIBUTION = REPLICATE)

But when do we choose a distributed or replicated table?

As a rule of thumb, you will want to create tables which contain reference data, or – as we say in the data warehouse environment - dimensions, as replicated tables if they are not too big. The reason is simple. The typical data warehouse query will be a star join between the fact and the dimension tables with where-conditions on columns of the dimension tables, grouping (group by) on columns of the dimension tables and aggregations based on columns of the fact tables. So, if we distribute the large fact tables in order to leverage the full power of the MPP engine, having the dimensions on each compute node allows the compute node to answer the query with out needing data from other compute nodes. Let’s take a look at the following query based on a customer dimension which is linked to a sales table:

select Sum(SalesAmount) from FactSales
inner join DimCustomer On FactSales.CustomerKey=DimCustomer.CustomerKey
where DimCustomer.Region='EMEA'

No matter on which key the FactSales table is distributed, having the DimCustomer table replicated means, that each compute node can individually compute the Sum of sales for the customers in the EMEA region. There still has to be a final aggregation for the results coming from each compute node, but in this case, this is just one line per compute node.

Also consider, that read/write is much faster with distributed tables (parallel process) compared to replicated table. This is one reason why replicated tables should be used for a smaller amount of data.

 

Choosing a good distribution key

The following aspects are important when choosing a distribution key:

  • What kind of workload do we have?
    (do we usually see lots of “atomic” reads, returning only a few rows, or do we more likely expect large scans and aggregates on the table)
  • What are the typically performed joins among the tables
  • What are the typically performed aggregations (group by) used on the tables?
  • How is the distribution key itself distributed?
    Choosing a distribution key which is unequally distributed will result in skew. The different distributions of the table should contain almost the same number of rows in order to have a good parallelization of queries over the compute nodes. If all the data sits on one compute node because of a bad distribution, this node becomes the bottleneck and you cannot expect a good performance.

I’m getting back to the challenge of finding a good distribution key in later posts.

Please keep in mind, that for both decisions, distributed vs. replicated and the distribution key, you don’t have to make a decision that lasts forever. In fact, it’s quite unlikely that you come up with the best solution at the very beginning. It’s quite easy to redistribute a table based on another distribution key or to turn a distributed table into a replicated one. For both scenarios, CTAS can be used. CTAS stands for Create Table As Select. This is quite similar to the Select into syntax on the SMP SQL Server. For example, if you want to change the distribution key of a table Sales you could follow these steps:

  • CTAS Sales to SalesNEW having the new distribution key
  • rename Sales to SalesBAK
  • rename SalesNEW to Sales
  • drop SalesBAK

Monday, April 22, 2013

Big Data and Analytics

PDW v2 | Big Data

In my former post about Big Data, I used a “definition” which can be abbreviated as

“data that is too big for analysis within the required time”

The key aspects of this phrase are:

  1. size of data
  2. time frame for the analysis
  3. complexity of analysis

 

The time frame can be real time, near time, a few hours or maybe even days. This depends on the business requirement. The size of data may get bigger than expected because you need additional data sources (for example external data from market places) for your analysis. But today I’d like to focus on the third bullet point: the complexity of analysis.

If you don’t have complex analysis requirements and if you have plenty of time, you can process terabytes of data without any big data issues. Remember that storing a huge amount of data is not the big problem. But retrieving the data and doing analysis on this data is much more challenging.

But what are complex analytical computations? In SQL we can do a lot of computations by aggregating detailed values (sum, average, min, max etc.). And for many of the typical business performance indicators, this works quite well. But what about the following tasks:

  • Frequency analysis and decompositions (Fourier-/Cosine-/Wavelet transformation) for example for forecasting or decomposition of time series
  • Machine learning and data mining, for example k-means clustering, decision trees, classification, feature selection
  • Multivariate analysis, correlation
  • Projections, prediction, future prospects
  • Statistical tests (for example chi-squared or binomial)
  • Trend calculations, predictions and probability for certain trends or scenarios
  • Complex models involving simulations (for example Monte Carlo simulation for risk analysis)
  • binomial, normal or other types of distributions and density functions

 

For example, a decomposition of a time series into its main components may look like this:

image

(Source: R, temperature in Nottingham taken from the datasets library)

Decomposing time series can helpful to analyze periodicity and trends of sales data for example. This could be important for calculating the effect of promotions or to understand seasonal effects.

And this is just one example. As long as you can only slice and dice on your existing data, you’re always looking at the past. But in order to derive ideas and guidance for future decisions, higher sophisticated methods are required than just sum/group by. Some people even say, that this is where Business Intelligence starts. Everything else is just an analysis of the past which is also important, but there is so much more to find. The current discussion about data scientists clearly shows the rising demand for getting more out of your data. And to be honest, having a data scientist working just with a tool like Excel is like having Dr. House using just a folding rule as medical instrument instead of all the sophisticated laboratory instruments and equipment…it doesn’t work.

So, there are a lot of calculations that go far beyond the capabilities of traditional SQL. Therefore, we usually need to load the data from our data warehouse into some kind of analytical or statistical tool which is specialized such calculations. The results can then be transferred back into the relational database tables. As the focus of such tools differs from the focus of databases, these tools are usually separated from the database but offer interfaces (for example ODBC or flat file) to load data. Common tools are R, SAS, MatLab, just to name a few of them. R (http://cran.r-project.org), for example, is a toolset for doing advanced calculations and research level statistical computing. R is open source and can easily be extended using packages (libraries). Today, a huge amount of such packages exists for all kinds of different tasks.

However, when it comes to Big Data, the process of unloading all the required data can be very time consuming. So for Big Data analytics it’s important to bring both worlds together. This would be the perfect match. For doing so, the following two options are most promising:

  1. Using Hadoop (Map/Reduce)
  2. Using In-Database Analytics

 

Hadoop

PDW v2 offers a seamless integration with Hadoop using Polybase. This makes it easy and fast to export data on a Hadoop infrastructure. Research level analytics can then be performed on the Hadoop cluster. For this purpose, R supports distributed analysis and Map/Reduce jobs using the HadoopStreaming library. But we’re still copying the data out to the analytical environment, right? Yes, but in this scenario, each infrastructure is used in an optimal way:

    • PDW for high-performance SQL queries to combine and arrange the data in the format needed for analytical engines (more like a NoSQL format, for example to prepare variables for data mining).
    • Hadoop for distributed parallel computing tasks using Map/Reduce jobs
    • High performance (massive parallel) data transfer between the MPP (PDW) and Hadoop.
    • Transparent access of the analytical results using SQL (seamless integration of relational and non-relational data with Polybase)

Preparing the data for analytics can be a complex and challenging process. Usually data from multiple tables needs to be joined and filtered. Using SQL is the best choice for this task. For example, for preparing call center data for a mining model, it may be necessary to create variables (single row of data) that contain the number of complaints per week over the last weeks. This can then be used to build a decision tree. In SQL, this task is easy and in an MPP environment, we get the best performance for this task. For the decision tree we need to perform a feature selection at each node of the tree. This involves statistical functions and correlation which reach far beyond SQL. Using the analytical environment is the best choice for such advanced calculations. The resulting decision tree (rules, lift chart, support probabilities etc.) can then be stored as a file on the Hadoop cluster and from there being queried or imported back into relational database tables using Polybase.

 

In-Database Analytics

Another approach is to operate the analytical engine on the same platform and on the same data as the MPP database system. This approach ties both worlds together in a very consistent way but it’s currently not available on the PDW (although it is on my personal wishing list). However, in other MPP environments, this approach is not uncommon. For example, in SAP HANA you can write stored procedures in R just like this

CREATE PROCEDURE myCalc (…)

LANGUAGE RLANG AS

BEGIN

END;

The function body is then standard R code using R syntax, not SQL.

Typical features for In-Database Analytics include:

  • Analytical stored procedures
  • In-database analytics: direct access to database tables and views from the analytical engine without needing to load/unload the data
  • Tables/Views as parameters for the analytical functions (for example R data frames)
  • Full utilization of in-memory capabilities
  • Full utilization of the parallel query engine

 

Conclusion: In order to perform sophisticated analysis based on your BI data, SQL is not sufficient. Specialized toolsets like R are the the best solution. However, when it comes to Big Data, loading/unloading the data into these toolset may not be efficient anymore. A closer integration is necessary. Using Hadoop or In-Database Analytics are promising approaches for this scenario.

Saturday, April 6, 2013

What’s the buzz about MPP Data Warehouses (part 2)?

PDW v1/v2

In my first post I wrote about the need of a consequently tuned and aligned database server system in order to handle a high data warehouse workload in an efficient way. A commonly chosen implementation for this is a massive parallel shared nothing architecture. In this architecture your data is distributed among several nodes, each with their own storage. A central node processes incoming queries, calculates the parallel query plan and sends the resulting queries to the compute nodes.In a simplified form, this architecture looks as shown below:

image

Since different vendors choose different detail strategies, from now on, I’m focusing on the Microsoft Parallel Data Warehouse, or in short, the Microsoft PDW. The PDW is Microsoft’s solution for MPP data warehouse systems. The PDW ships as an appliance, i.e. as a pre-configured system (hard- and software) of perfectly compatible and aligned components, currently available from HP and DELL.

What happens if data is loaded into such a system? Let’s assume we have a table with 6 rows of sales and for simplicity, let’s assume we only have two compute nodes. In order to distribute the data among the compute nodes, a distribution keys needs to be chosen. This key will be used to determine the compute node for each row. Why don’t we just do a round robin distribution? I’m getting back to this point later in this post. The distribution key (table column) is used in a hash function to find a proper node. The hash function takes into account the data type of the distribution key, as well as the number of distributions. Actually, in PDW the table is also distributed on the compute node itself (8 different tables on different files/file groups) to get the optimal usage of the compute node’s cores and the optimal throughput to the underlying storage. For our example, let’s assume that the date values hash to the nodes as shown in this illustration:

image

As you see, each row of data is routed to a specific compute node (no redundancy). Doesn’t make this the compute node a single point of failure in the system?  Actually no, because of the physical layout of the PDW. In PDW v2 two compute nodes share one JBOD storage system, one of them communicating actively with the JBOD, the other using the infiniband network connection. The compute nodes itself are “normal” SQL Server 2012 machines running on Hyper-V. If a compute node fails, the data is still reachable using the second compute node that is attached to this JBOD. The compute nodes form an active/passive cluster, therefore the spare node can take over, if a node fails. The damaged node may easily be repaired or replaced. A Hyper-V image for a compute node sits on the management node (which I omitted in the illustration above). And again, this is just a very broad overview of the architecture. You can find very detailed information on the technology here: http://www.microsoft.com/en-us/sqlserver/solutions-technologies/data-warehousing/pdw.aspx

With the example above, what happens if we query a single date? Since we distributed the table on the date, a single compute node contains the data for this query. The control node can pass the query to the nodes and has no more action to take. The compute node containing the data can directly stream the data to the client. The same would happen if we run a query that groups by Date (and potentially filters by some other columns). Now both compute nodes can separately compute the result and stream the result to the client. What you see from this example is

  • In this case, two machines work in parallel and fully independently from each other
  • Since we distributed on Date and the query uses Date in the grouping, no post processing is necessary (we call this an aggregation compatible query)
    (if the distribution would have happened based on a round robin approach, there would never by an aggregation compatible query)
  • In order to get the best performance in this case, it’s important that the data is equally distributed between the compute nodes. In the worst case of all the data being queried sitting on only one of the compute nodes, this one node would have to do the full work. Choosing a proper distribution key can be challenging. I’m getting back to this in a subsequent post.

What happens if we run a query like the following?

select Sum([Sales Amount]) from Sales

Again each compute node can compute the individual partial result but now these results need to be send to the control node to calculate the final result (so called partition move). However, in this example, the control node gets much less rows to process compared to the total amount of rows. Imagine millions or billions of rows being distributed to the compute nodes. The control node in this example would only get two rows with partial results (as we have two compute nodes in our example). So, this operation still fully benefits from the parallel architecture.

image

And this works for most kind of aggregations. For example, if you replace sum([Sales Amount]) with avg([Sales Amount]), the query optimizer would ask the compute nodes for the sum and count and compute the average in the final step.

Ok, usually data models are more complicated (even in a data warehouse) than a single table. In a data warehouse we usually find a star (or snowflake) relationship between facts and dimensions. For the sales table above, this could look like this:

image

What happens now, if the query above is filtered by the product group, which is an attribute of the product dimension?

select sum([S.Sales Amount]) from Sales S
inner join Product P on S.ProductKey=P.ProductKey
where P.ProductGroup='X'

How should we distribute the product table on the compute nodes in order to get good query performance? One option would be to distribute on the same key as the Sales. If we can do so, each compute node would see all products that are related to the sales that sit on this compute node and therefore answer the query without needing any data from other nodes (we would call this a distribution compatible query). However, the date is not a column in the product table (this wouldn’t make sense) so we cannot distribute the products in this way. It would work, if we had a SalesOrderHeader and SalesOrderDetail table, both joined and distributed on a SalesOrderID. But for the product dimension (as for most other dimensions too) we can go for a more straightforward approach. Fortunately, in a data warehouse, dimensions usually contain very few rows compared to the fact tables. It’s not uncommon to see over 98% of the data in fact tables. So for the PDW, it would make no difference if we put this table on all compute nodes. We call this a replicated table, while the Sales table itself is a distributed table. By making the Sales table distributed and the dimensions replicated, each compute node can answer queries that filter or group on dimension table attributes (columns) autonomically without needing data from other compute nodes.

 

image

Of course, this is just a very brief overview to show the basic concept. If we need to scale the machine, we could add more compute nodes and (after a redistribution of the data) can easily benefit from the higher computing power. This means we can start with a small machine and add more nodes as required which gives a great scalability. For the PDW v2 you can scale from about 50TB to about 6PB with a linear performance gain.

Also, PDW v2 offers a lot more features. Especially I’d like to mention the clustered column store index (CCI), which is a highly compressed, updatable in-memory storage of tabular data. Together with the parallel processing of the compute nodes this gives an awesome  performance when querying data from the PDW. Also the seamless integration with Hadoop (via PolyBase) allows us to store unstructured data in a Hadoop file system and query both sources transparently from the PDW in the well known SQL syntax without IT needing to transfer the data into the relational database or to write map-reduce jobs.

Again, there is much more to read about the PDW. A good starting point is the PDW website: http://www.microsoft.com/en-us/sqlserver/solutions-technologies/data-warehousing/pdw.aspx

 

Conclusion

Data Warehouses with large amount of data have challenges that go far beyond just storing the data. Being able to query and analyze the data with a good performance requires special considerations about the system architecture. When dealing with billions of rows, classical SMP machines can easily reach their limit. The MPP approach, that distributes data on multiple, independent compute nodes can provide a robust and scalable solution here. The Microsoft Parallel Data Warehouse is a good example for this approach that also includes features like in-memory processing (clustered column store index in v2) and a transparent layer on both structured and unstructured data (Hadoop).

Saturday, March 30, 2013

What’s the buzz about MPP Data Warehouses (part 1)?

PDW v1/v2

In the context of more and more data and the need of being able to analyze this data, you might also have stumbled over the MPP approaches for large data warehouses. MPP stand for massively parallel processing in contrast to SMP which means symmetric multi processing. A good definition both worlds can be found here: In an SMP machine, you usually have multiple CPUs which are sharing memory (RAM, disks) and are therefore well suited to boost performance on CPU critical tasks while in an MPP machine you also have multiple CPUs, but this time, each CPU has it’s own memory. Therefore MPP systems are better suited for a workload where you need a very high throughput of data. And this is what we typically see in data warehouses. Here we need to load large amount of data and we need to efficiently query large portions of this data.

But wait, comparing DWH and OLTP, I’m thinking of the following situation:

image

The main difference between an OLTP and a DWH solution is the data model and not the underlying hardware or database server software. Or, in other words, a modern database server should be suitable for both work loads, OLTP and DWH. The data model (database layout) however, differs a lot: In case of an OLTP database, we want to reduce redundancy and therefore build the data model as by normalizing the data to transactional and reference data and potentially complex relationship between tables. On the other hand, in a DWH model we want to be able to read large amounts of data with simple queries and therefore prefer a de-normalized model (star schema).

And as long as we do not have too much data, this point of view works fine. It’s surprising that it does, as both use cases have different requirements to the underlying infrastructure.

OLTP-System

  • Usually small batches of data (transactions), usually structured in a complicated way (covering many tables)
  • Needs to be able to roll back changes spanning multiple tables (complex transactions)
  • Ensure data integrity (foreign keys, other contraints)
  • Ensure simultaneous read/write access of users to the same amount of data (isolation level)
  • Support rich programming features (triggers, user defined functions etc.)

DWH-System

  • Load large amounts of data at specific loading times
  • Query large amounts of data (often in “full scan”), create aggregates

But still, for small amounts of data, you don’t have to consider the infrastructure too much. However, as the amount of data and its complexity grows, you have to think about ways of optimizing your data warehouse architecture:

image

The first step is to apply best practices for your data warehouse model. For example, loading large amount of data is not a good idea if you are having active foreign key constraints or – even worse – triggers – on your tables. But having a feature in a database software, does not necessarily mean, that you have to use it. So, here are some of these best practices:

  • Avoid active foreign key constraints when loading a large amount of data
  • use table partitioning and partition switching for updates rather than individual row insert/update processes (for example: late arriving facts)
  • Avoid granular transaction logging (simple recovery model)

On this step, you didn’t really touch the data warehouse system infrastructure at all. So your database server is still “universal”. On the next step of complexity, usually we start tuning the machine itself, for example

  • Choose a specific layout for your IO (SAN, RAID)
  • Choose a specific distribution of database files and file groups (log, temp etc.)
  • Use specifically tuned machines for the different tasks, like staging, ODS, data warehouse, data marts
  • Use server clusters to balance workload and provide high availability

At this step, the SQL Server becomes more and more optimized for data warehouse workload. It will be possible to run OLTP workload too, but this maybe less efficiently, as we started to optimize for DWH workload.

However, as the amount of data grows, one question comes in to mind: Wouldn’t it be better, to really optimize the database server for DWH workload? And consequently, don’t consider OLTP requirements as we do this optimization? This will offer different ways of storing and handling the data. If we follow this path, we get an infrastructure that might not be suited for OLTP traffic at all, but perfectly supports large loads and fast reads of very large data.

image

MPP data warehouse solutions, like Teradata, Oracle Exadata, IBM Netezza and Microsoft Parallel Data Warehouse or Greenplum are examples of these approaches. Usually, the approach is a shared nothing MPP architecture of nodes, which have their own segment of data on their own disks (not a shared memory or disk). Most consequently all components (including the hardware) are perfectly tuned and aligned for this purpose. To achieve this, pre-installed and configured appliances are commonly used, so instead of buying hardware and software individually and trying to make it run well and fast, you get a “black box” (i.e. one or more racks) of components and software that are selected and configured in the best possible way.

In part 2 of this post, I’ll show the basic ideas of this shared nothing architecture and how query performance can benefit from the distribution of data on several compute nodes.

Sunday, March 24, 2013

When data gets big

BI

So this post is about big data. When looking around on the internet, you can find amazing examples of big data. For example, the Large Hadron Collider of the CERN generates about 15 Petabytes of data every year (see http://www.lhc-facts.ch/index.php?page=datenverarbeitung for details). This is about 41 Terabyte each day. Impressive. However, you might argue that you don’t have such a collider in your company. In fact, most companies will only have to deal with a very small fraction of this amount of data. So where does big data start for common business applications? And what does it mean for the IT strategy. Does it have an influence or is it just a matter of scaling and improving systems – a process that we always have to do in IT to keep up with business requirements.

Wikipedia defines big data as “a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications” (http://en.wikipedia.org/wiki/Big_data). And I think that this definition is a good starting point because it focuses not only on the amount of data but puts that amount in relation to what we do with the data (process). Let me paraphrase this definition:

I’m talking of big data if my analytical capabilities are no longer sufficient because of the amount or complexity of the data and if I’m not able to scale this capabilities using traditional approaches (more RAM, more servers in my clusters, more disks etc.). It’s not difficult (and not expensive) to store Petabytes of data. It’s difficult to process that data and to do analytics on this data and to gain insights.

So, to be honest, even a few hundred million rows of data may be big, if I’m not able to perform the important analytics, that I need to supply my core business processes in a timely manner. And there are two things to keep in mind:

  • Modern ideas of modeling markets and complex statistical models and methods are available.
  • While it may be difficult to apply these methods, maybe our competitors already do.

Also, another aspect about analytical capabilities is the question “Do I have the right data or do I need other data sources?”. Limits in analytical capabilities may also exists because I don’t have the information I would need. In todays world with lots of data markets (like Microsoft Azure Datamarket, http://datamarket.azure.com/), it’s reality that you can get information/data that you might not even have dared dreaming about a few years ago. Now you get data about consumer trends, your competitors or global trends and you get this data in a reliable, accurate and up-to-date way. Again, this increases the amount of data you need to process and by that, may worsen the analytical restrictions that result from the pure amount of data.

But then, this still is nothing new. As I mentioned before IT had to follow these requirements during each of the last years. We added more cores, bought newer machines. The database algorithms improved, we used OLAP and other technologies to speed up analysis. But let me get back to the second half of the definition from above: “it becomes difficult to process using on-hand database management tools or traditional data processing applications”. If you like, you may replace “difficult” with “makes it more expensive” or “costs more afford” or – in some cases – “makes it impossible, at least for the required period of time”. For example, if you want to calculate a complex price elasticity model in retail and it takes you a month to do so, the result will not be useful anymore as the situation in your market might have already changed significantly (for example because of your competitors’ campaigns).

Again, this is not really new. During the past you may have added other components in your IT strategy, for example OLAP. Or you have replaced a slow database solution with a faster one. And you focused on scalability in order to cope with these challenges. So, you might look at some typical components of a big data environment in just the same way. Here are some of them (be careful, buzzword mode is switched on now):

  • Hadoop, Hive, Pig etc.
  • In-Memory Computing
  • Massive parallel computing (MPP).
  • Cloud
  • Complex event processing (CEP)
  • Etc.

[buzzword mode off] However, if you think of these components in a traditional way of enhancing the IT infrastructure you might think in the wrong direction. The main thing about big data is, that when you get to the limits of your analytical capabilities (as in the definition from above) there are almost always tools and methods to get beyond those limits. However, these tools may require some fundamental changes in the IT ecosystem. As for MPP databases, for example, it’s not done getting one and putting all the data on it, but it is about re-shaping the BI architecture in order to match the new paradigm of those systems.

image

During the next posts, I’ll get a little bit deeper in this topic, the fundamental changes in the Big Data Architecture and especially MPP databases.