Search This Blog

Sunday, March 23, 2014

Reading recommendation: “3 Ways to Test the Accuracy of Your Predictive Models”

Data Mining

Imagine being in a data mining project where people are all enthusiastic about finding hidden patterns and doing some data magic. Now you set up and train your model and the model gives some results (for example scores for the test dataset). You might be feeling a little bit lost at that point. Are the results sensible and reliable? Did I really choose a good approach, the right model and algorithm for that problem? The fact is, that most mining tools are giving you “results” even if your modeling approach has severe flaws, but the results simply have no meaning. So, how can you be sure that your model works fine?

The situation reminds me a little bit of Douglas Adams’ Hitchhiker’s Guide to the Galaxy. Finally you got an answer, but you are not longer sure what this means. What options do you have to be assured (and to assure your stake holders) that business decisions can be based upon the results of that mining model?

Victoria Garment from Software Advice, a website that researches business intelligence systems, gathered methods used by data mining professionals Karl Rexer, Dean Abbott and John Elder to test and validate the accuracy of a mining model. You can find the full report here:

http://plotting-success.softwareadvice.com/3-predictive-model-accuracy-tests-0114/

What makes this article a must-read for me, is that it doesn’t only cover methods for accuracy testing (lift charts, decile tables, target shuffling, bootstrap sampling, cross validation) but also contains many practical examples and touches various side aspects that are important for data mining. For example when John Elder talks about recognizing false patterns like the Redskins Rule. Or Dean Abbott, as he shows how easily models can be overfit and what methods you have to correct them. I especially like one quotation by John Elder: “Statistics is not persuasive to most people—it’s just too complex”. And this is true, as from my own experience it is very important not only to design a good prediction model but also to assure the business decision makers that they can trust the results and base their decisions on those results. Target shuffling as described in the article may be one promising approach to make business people trust the results of a predictive model without being a master of science in statistics.

Again, the full article can be found here:

http://plotting-success.softwareadvice.com/3-predictive-model-accuracy-tests-0114/

Sunday, March 16, 2014

Converting events to hourly based aggregations

PDW 2012 | SQL Server 2012

For today’s post I didn’t find a really good title. Here’s what this post is about: Sometimes you’ll find event based data in your source system (something happens at a specific point in time) but for the data warehouse you want to transform this data to match a given time dimension. The situation is similar to an older post I wrote about SQL Server window functions.

There are some approaches to accomplish this task. For today’s post I’d like to show a SQL-based approach we’re currently using in an ELT process on a Parallel Data Warehouse (PDW) 2012.

Let’s assume you’re working for a retailer who is interested in the number of cash desks being open at a time. A subset of the event data (single day, single store) may look like this:

t1

In this case cash desk 2 opened at 06:35, then desk 1 opens at 8:27, then desk 2 closes at 11:58 and so on. The question is, how many cash desks are open for example from 08:00 to 09:00? If a cash desk is only open for half an our in the given time range, it should be counted as 0.5, so between 08:00 and 09:00 approximately 1.5 cash desks where open (desk 2 for the full hour and desk 1 for half of an hour).

In order to get the number of cash desk being open, we first convert the transaction type to a delta: +1 means a desk opens, –1 means a desk closes. Here is the query together with the result:

select
  TransactionDate
  , StoreID
  , TransactionTime
  , CashdeskID
  , case TransactionType
    when 'signon' then 1
    when 'signoff' then -1
  else 0
  end CashdeskDelta
from (
  select TransactionDate, StoreID, TransactionTime, CashdeskID,TransactionType
  from POSData where TransactionType in ('signon','signoff')) P
order by 1,2,3

The result (again a subset) may look like this:

t2

After transforming the transaction type to a numeric value, we can aggregate it using a window function. Therefore I’m using the query from above as a sub query:

select *,
sum(CashdeskDelta) over (partition by TransactionDate, StoreID order by [TransactionTime]) OpenCashdesks,
datediff(s,[TransactionTime],lead([TransactionTime],1) over (partition by TransactionDate, StoreID order by [TransactionTime])) TimeDelta
from
(
-- query from above --
) CheckoutDetails
order by 1,2,3

Again, this shows the power of the window functions. The query gives us the number of open cash desks together with the number of seconds to the next event.

t3

For example, from 8:27 to 11:58, 12622 seconds passed. During this time, 2 cash desks were open. This is a huge step towards the solution but we still have no hour based time frame in the data. However, this can easily be created by cross joining the dimensions for store and time. For my example, I have no store or time dimension (as you should usually have), so I’m using the table sys.all_objects here to generate a sufficient number of data rows:

with
Timeslot AS (
select T2.TransactionDate, T2.StoreID, T1.TimeGrid from
(select top 24 convert(time,dateadd(hour,row_number () over (order by [name])-1,'00:00:00')) TimeGrid from sys.all_objects) T1
cross join
(select distinct TransactionDate, StoreID from POSData) T2
)

t4

The query creates one row per hour for each store and each date. Again, usually you would use your existing dimension tables instead of the sys.all_objects table here.

Now, let’s bring both queries together:

with

Timeslot AS
(
.. – timeslot query from above
),

CashDeskTimesDetails as
(

select *,
sum(CashdeskDelta) over (partition by TransactionDate, StoreID order by [TransactionTime])
CashdesksOpen
,sum(CashdeskDelta) over (partition by TransactionDate, StoreID order by [TransactionTime])*
TimeDeltaSeconds CashdeskOpenSeconds
,convert(time, dateadd(hour, datediff(hour, 0, TransactionTime),0)) TransactionHour
from
(
select
TransactionDate
, StoreID
, TransactionTime
, coalesce(
datediff(s,[TransactionTime],lead([TransactionTime],1) over (partition by
TransactionDate, StoreID order by [TransactionTime]))
,
datediff(s,[TransactionTime],dateadd(day,1,0)) -- fill seconds to end of day
)
TimeDeltaSeconds
, CashdeskID
, case TransactionType
when 'signon' then 1
when 'signoff' then -1
else 0
end CashdeskDelta
from (
select TransactionDate, StoreID, TransactionTime, CashdeskID,TransactionType from
POSData where TransactionType in ('signon','signoff')
union all
select TransactionDate, StoreID, TimeGrid, 0, 'timeslot' from Timeslot
) P
) CheckoutDetails
)
select * from CashDeskTimesDetails
order by 1,2,3

The result shows the original data together with the fixed time frame (24 hours).

t5

Some things to pay special attention to:

  • The inserted timeslots are created with a Cashdesk delta value 0, so they do not modify the number of open/closed desks (column CashdeskOpen)
  • In blue: the last time slot has no subsequent timeslot. Therefore the window function returns null. Here we override this with the number of seconds until day’s end.
  • In red: We add the base hour to each row. This will be used for a group-by operation in the following step

Finally, we simply need to aggregate the last query result:

select 
TransactionDate,TransactionHour,StoreID,
Sum(convert(float,CashdeskOpenSeconds)) / Sum(convert(float,TimeDeltaSeconds)) CashdeskCount
from CashDeskTimesDetails
where TimeDeltaSeconds<>0
group by TransactionDate,TransactionHour,StoreID
order by 1,2,3

 

Here is the final result for the sample data subset from above:

t6

Removing all filters (all dates, all stores) may result in a graph like this:

image

So this post showed how to transform event based data to a fixed time scale (hours in this case) to match a star schema join to the time dimension. Since we only used SQL this process can be easily used in an ELT loading scenario.

Sunday, March 2, 2014

ETL or ELT… or both??

PDW 2012 | SQL Server 2005-2014

With database servers getting more and more powerful, some traditional concepts of Business Intelligence solution may be reconsidered. One of those concepts for which you can find a lot of lively debates recently is the question of whether to use ETL or ELT.

Here are just a few of the blog posts and discussions you can find on this topic:

In short, the main differences are shown in the table below:

  ETL ELT
Characteristics
  • ETL=Extract-Transform-Load
  • Transformation done in the ETL tool (data flow pipeline), only the finally prepared data is loaded to the data warehouse data base
  • ELT=Extract-Load-Transform
  • Original data is loaded into database, then SQL is used to transform the data into the dimensional model
Pros
  • Well established ETL tools available with debugging, build in logging, configuration, error handling and process reporting and statistics
  • comprehensive and easy to maintain data flow makes it easy to merge data from different sources, use of data quality tools and individually deal with error rows.
  • ELT fully benefits from the database power, query optimizer and so on.
  • Especially for MPP environments (like the PDW): Scaling the database means scaling up for ELT process perfromance as well
  • SQL code is easier to maintain in source control systems (like TFS) than ETL packages (complex XML).
Cons
  • ETL pipeline tools support multiple cores, but parallel IO has to be solved programmatically: you have to do something
    (for example use the balanced data distributor SSIS component or merge multiple inputs from the same source table)
  • ETL tools are build for row based operations. Operations that need to be performed on a set of rows (like sort and aggregate or calculations covering multiple rows) are harder to solve. I wrote some posts recently about ELT calculations that are relatively difficult to solve in ETL.
  • SQL is harder to read, to structure and to document compared to ETL packages
  • You need discipline as minor wrongness may lead to errors that are hard to track down (e.g. too many resulting lines from join operation if a key is missing in the join)

This comparison is by far not complete and if you read the links above (and many others that target this topic) you can find a lot more pros/cons and opinions. In fact, I don’t want to say one is better than the other. But here is what we recently found to work well in a large project using the Parallel Data Warehouse (PDW) for an initially 30TB (and growing) database. The following illustration which I recently used on SQL Conference 2014 in Germany shows the key concepts:

image

 

We’re using ETL (Microsoft SQL Server Integration Services) to

  • Orchestrate the load process
    • workflow management (make sure the right things happen in the right order)
    • dealing with technical challenges (e.g. temporary tables, partition switching on PDW)
    • implement configuration management (for example server and database names)
    • logging and process monitoring (reports)
  • Load dimensions (small amount of data)
    • Collecting master data from source systems, merge and prepare this data
    • Generation of surrogate keys for the dimensions
    • Keeping track of historical changes (modeled as SCD2 or intermediate bridge tables)
    • Building up all dimensions and transferring the data to the PDW (using a reload operation)
    • Early arriving facts (create missing dimension rows, distinct counts run on PDW)

Why?

  • Integration Services (SSIS) well suited for these tasks
  • SMP SQL Server offers good support for dimension processing tasks (identity column, T-SQL merge statement etc.)
  • Additional services like SQL Server Data Quality Service (DQS) and SQL Server Master Data Services (MDS) are currently not supported to run on the PDW
    This is also true for more sophisticated tasks and the use of web services for example to find duplicate customers,to correct misspelled street names, to guess the gender from the first name. Also if you need to use custom assemblies, for example to access special source systems or include specific calculations, ETL tools are the better choice.

 

Then, we’re using ELT (distributed SQL, DSQL) on the Parallel Data Warehouse to

  • process fact data (large amount of data) after it is bulk loaded with no modifications into a staging database on the PDW
    • Data preparation (for example removing duplicate rows)
    • Linking fact table data to dimensions
    • Performing calculations (using SQL window functions intensively)
  • Merge new data to archive
    • store the data in the persisted stage area (without creating duplicates if the data was there already)

Why?

  • Much better performance observed compared to SMP SQL Server/SSIS
    • in our case, usually about 10-20 times faster, depending on the source data and the transformations
    • In some cases (for example removing duplicate rows in the source data) even 100 times faster
  • Faster loads allow us to fully reload many TB in case this is needed (this gives more options for backup strategies and for the dimensional modeling)
  • Solution will directly benefit from future MPP scale up without any need of coding

 

CONCLUSION

ETL and ELT may work well together. In this scenario we did the dimension processing as well as the full workflow management using ETL tools (SSIS on an SMP SQL Server) and the processing of the large transactional tables using ELT (distributed SQL on PDW).