Search This Blog

Loading...

Sunday, March 16, 2014

Converting events to hourly based aggregations

PDW 2012 | SQL Server 2012

For today’s post I didn’t find a really good title. Here’s what this post is about: Sometimes you’ll find event based data in your source system (something happens at a specific point in time) but for the data warehouse you want to transform this data to match a given time dimension. The situation is similar to an older post I wrote about SQL Server window functions.

There are some approaches to accomplish this task. For today’s post I’d like to show a SQL-based approach we’re currently using in an ELT process on a Parallel Data Warehouse (PDW) 2012.

Let’s assume you’re working for a retailer who is interested in the number of cash desks being open at a time. A subset of the event data (single day, single store) may look like this:

t1

In this case cash desk 2 opened at 06:35, then desk 1 opens at 8:27, then desk 2 closes at 11:58 and so on. The question is, how many cash desks are open for example from 08:00 to 09:00? If a cash desk is only open for half an our in the given time range, it should be counted as 0.5, so between 08:00 and 09:00 approximately 1.5 cash desks where open (desk 2 for the full hour and desk 1 for half of an hour).

In order to get the number of cash desk being open, we first convert the transaction type to a delta: +1 means a desk opens, –1 means a desk closes. Here is the query together with the result:

select
  TransactionDate
  , StoreID
  , TransactionTime
  , CashdeskID
  , case TransactionType
    when 'signon' then 1
    when 'signoff' then -1
  else 0
  end CashdeskDelta
from (
  select TransactionDate, StoreID, TransactionTime, CashdeskID,TransactionType
  from POSData where TransactionType in ('signon','signoff')) P
order by 1,2,3

The result (again a subset) may look like this:

t2

After transforming the transaction type to a numeric value, we can aggregate it using a window function. Therefore I’m using the query from above as a sub query:

select *,
sum(CashdeskDelta) over (partition by TransactionDate, StoreID order by [TransactionTime]) OpenCashdesks,
datediff(s,[TransactionTime],lead([TransactionTime],1) over (partition by TransactionDate, StoreID order by [TransactionTime])) TimeDelta
from
(
-- query from above --
) CheckoutDetails
order by 1,2,3

Again, this shows the power of the window functions. The query gives us the number of open cash desks together with the number of seconds to the next event.

t3

For example, from 8:27 to 11:58, 12622 seconds passed. During this time, 2 cash desks were open. This is a huge step towards the solution but we still have no hour based time frame in the data. However, this can easily be created by cross joining the dimensions for store and time. For my example, I have no store or time dimension (as you should usually have), so I’m using the table sys.all_objects here to generate a sufficient number of data rows:

with
Timeslot AS (
select T2.TransactionDate, T2.StoreID, T1.TimeGrid from
(select top 24 convert(time,dateadd(hour,row_number () over (order by [name])-1,'00:00:00')) TimeGrid from sys.all_objects) T1
cross join
(select distinct TransactionDate, StoreID from POSData) T2
)

t4

The query creates one row per hour for each store and each date. Again, usually you would use your existing dimension tables instead of the sys.all_objects table here.

Now, let’s bring both queries together:

with

Timeslot AS
(
.. – timeslot query from above
),

CashDeskTimesDetails as
(

select *,
sum(CashdeskDelta) over (partition by TransactionDate, StoreID order by [TransactionTime])
CashdesksOpen
,sum(CashdeskDelta) over (partition by TransactionDate, StoreID order by [TransactionTime])*
TimeDeltaSeconds CashdeskOpenSeconds
,convert(time, dateadd(hour, datediff(hour, 0, TransactionTime),0)) TransactionHour
from
(
select
TransactionDate
, StoreID
, TransactionTime
, coalesce(
datediff(s,[TransactionTime],lead([TransactionTime],1) over (partition by
TransactionDate, StoreID order by [TransactionTime]))
,
datediff(s,[TransactionTime],dateadd(day,1,0)) -- fill seconds to end of day
)
TimeDeltaSeconds
, CashdeskID
, case TransactionType
when 'signon' then 1
when 'signoff' then -1
else 0
end CashdeskDelta
from (
select TransactionDate, StoreID, TransactionTime, CashdeskID,TransactionType from
POSData where TransactionType in ('signon','signoff')
union all
select TransactionDate, StoreID, TimeGrid, 0, 'timeslot' from Timeslot
) P
) CheckoutDetails
)
select * from CashDeskTimesDetails
order by 1,2,3

The result shows the original data together with the fixed time frame (24 hours).

t5

Some things to pay special attention to:

  • The inserted timeslots are created with a Cashdesk delta value 0, so they do not modify the number of open/closed desks (column CashdeskOpen)
  • In blue: the last time slot has no subsequent timeslot. Therefore the window function returns null. Here we override this with the number of seconds until day’s end.
  • In red: We add the base hour to each row. This will be used for a group-by operation in the following step

Finally, we simply need to aggregate the last query result:

select 
TransactionDate,TransactionHour,StoreID,
Sum(convert(float,CashdeskOpenSeconds)) / Sum(convert(float,TimeDeltaSeconds)) CashdeskCount
from CashDeskTimesDetails
where TimeDeltaSeconds<>0
group by TransactionDate,TransactionHour,StoreID
order by 1,2,3

 

Here is the final result for the sample data subset from above:

t6

Removing all filters (all dates, all stores) may result in a graph like this:

image

So this post showed how to transform event based data to a fixed time scale (hours in this case) to match a star schema join to the time dimension. Since we only used SQL this process can be easily used in an ELT loading scenario.

1 comment: