Search This Blog

Saturday, January 4, 2014

T-SQL 2012 Window Functions in ELT (part 2)

SQL Server 2012

In my last post I explained some of the capabilities of the SQL Server 2012 window functions. Window function allow us to do computations involving other rows in a result set. Much like in my last post, I’m using the same sensor source for some more demonstrations of window functions and how they can help us in data transformation. Here is a sample of my sensor data:


As you can see, the sensors deliver their data approximately every 15 minutes. But imagine you need the data exactly in a 15 minutes interval with data being interpolated based on their distance to the 15 minutes slot. Again, ETL (a script task for example) or cursor come into ones mind but let’s assume we want to do the computation in T-SQL. How could we do that.

Here comes a simple approach for such kinds of problems:

  • Perform a union all of the existing data with a fixed interval time stamp data.
  • Sort by timestamp and compute time difference to the row before/after
  • Perform the interpolation

For the time stamp data I use a table like the following one, which contains the true 15 minutes intervals:


The union-operation brings both tables together:

select SensorID
       , [DateTime]
       , Temperature
       , Throughput
       , FailCount
       , 'Orig' as [Type]
from [dbo].[SensorData]
union all
  select distinct SD.SensorID, T.[Time], null,null,null, 'Gen' from [dbo].[SensorData] SD
  cross join [dbo].[Time] T
order by SensorID, [DateTime]

I included an information if the data is from the original data set (‘Orig’) or from the generated time stamps (‘Gen’). The result looks like this:


For a more readable query, I load this into a temporary table (#tmp) by wrapping the statement from above with a

select * into #tmp
from ( << query from above >>) Mix

Next step is to compute the previous and next values for the generated rows of data. I load this in a temporary table (#tmp2) too and also compute the time difference to the previous and next row in seconds:

    , LAG([DateTime],1) over (partition by sensorid order by [DateTime]) TimeLag
    , LEAD([DateTime],1) over (partition by sensorid order by [DateTime]) TimeLead

    , coalesce(DATEDIFF(second, LAG([DateTime],1) over (partition by sensorid order by [DateTime]), [DateTime]),0) SecondsLag
    , coalesce(DATEDIFF(second, [DateTime], LEAD([DateTime],1) over (partition by sensorid order by [DateTime])),0) SecondsLead

    , LAG([Temperature],1) over (partition by sensorid order by [DateTime]) TemperatureLag
    , LEAD([Temperature],1) over (partition by sensorid order by [DateTime]) TemperatureLead
    , LAG([Throughput],1) over (partition by sensorid order by [DateTime]) ThroughputLag
    , LEAD([Throughput],1) over (partition by sensorid order by [DateTime]) ThroughputLead
    , LAG([FailCount],1) over (partition by sensorid order by [DateTime]) FailCountLag
    , LEAD([FailCount],1) over (partition by sensorid order by [DateTime]) FailCountLead
into #Tmp2
from #Tmp

The most important columns are shown here (only for the temperature):



Next task is to do the interpolation. I’m using a simple linear blending based on the time difference by updating the records of Type ‘Gen’ appropriately. On a PDW, this would not be a recommended operation. You would rather CTAS (create table as select, similar to select into) the result to a result table:

update #Tmp2
    Temperature =
            when TemperatureLag IS NULL then TemperatureLead
            when TemperatureLead IS NULL then TemperatureLag
            else (TemperatureLag*SecondsLead+TemperatureLead*SecondsLag) / (SecondsLag+SecondsLead)

    , ThroughPut =
            when ThroughputLag IS NULL then ThroughputLead
            when ThroughputLead IS NULL then ThroughputLag
            else (ThroughputLag*SecondsLead+ThroughputLead*SecondsLag) / (SecondsLag+SecondsLead)
    , FailCount =
        iif (coalesce(FailCountLag,0)>coalesce(FailCountLead,0),coalesce(FailCountLag,0), coalesce(FailCountLead,0))

    from #Tmp2
where [Type]='Gen'

The result now shows the sensor data perfectly aligned and interpolated on a 15 minutes interval:


No comments:

Post a Comment