PDW 2012 | SQL Server 2012 | SQL Server 2014

If your source data contains a subsequent number without gaps it’s relatively easy to find out if data rows are missing. The approach I’m showing here uses window functions that are available since SQL Server 2012 and SQL Server Parallel Data Warehouse 2012.

In order to have some sample data for this post, I’m using the FactInternetSales table of the AdventureWorksDW2012 database. Let’s pretend the column SalesOrderNumber of that table should not have any gaps. I convert the column data to a numeric type and use only the rows having line item sequence number equal to 1 for my sample data.

SELECT

SalesOrderNumber,

convert(int, substring(SalesOrderNumber,3,255)) SalesOrderIntNumber

FROM [FactInternetSales]

WHERE [SalesOrderLineNumber]=1

ORDER BY SalesOrderNumber

Usually the order number is sequentially but we find some gaps here. For example, the order following order number SO43842 is SO43918, so there are 43918 – 43842 – 1 = 75 rows missing.

Using window functions and a sub query, we can add the next number as a column to the query together with the distance:

select *, NextSalesOrderIntNumber-SalesOrderIntNumber-1 MissingRows

from

(

select

SalesOrderIntNumber,

lead(SalesOrderIntNumber,1) over (order by SalesOrderIntNumber) NextSalesOrderIntNumber

from

(SELECT SalesOrderNumber, convert(int, substring(SalesOrderNumber,3,255))

SalesOrderIntNumber FROM [FactInternetSales] where [SalesOrderLineNumber]=1

) TransactionData

) TransactionDataSequence

As you can see, the 75 missing rows are now being reported correctly by the query. The only task left to do now is to aggregate the amount of missing rows by replacing the outer query like this:

select**Sum(NextSalesOrderIntNumber-SalesOrderIntNumber-1) MissingRows**

from

(

select

SalesOrderIntNumber,

lead(SalesOrderIntNumber,1) over (order by SalesOrderIntNumber) NextSalesOrderIntNumber

from

(SELECT SalesOrderNumber, convert(int, substring(SalesOrderNumber,3,255))

SalesOrderIntNumber FROM [FactInternetSales] where [SalesOrderLineNumber]=1

) TransactionData

) TransactionDataSequence

As a quality measure you could show the ratio of the missing rows to the total rows (or 100% minus this ratio as a data completeness measure) and – assuming that the missing rows had an average sales amount – also the estimated missing amount. And it’s also useful to get the result on more granular level, for example per month. Here is the full query:

select

orderdatekey/100 [Month],

Sum(NextSalesOrderIntNumber-SalesOrderIntNumber-1) MissingRows,

convert(float,Sum(NextSalesOrderIntNumber-SalesOrderIntNumber-1))/count(*)

MissingRowsRatio,

convert(float,Sum(NextSalesOrderIntNumber-SalesOrderIntNumber-1))/count(*)

* Sum([ExtendedAmount]) MissingRowsEstimatedValue

from

(

select

SalesOrderIntNumber,

lead(SalesOrderIntNumber,1) over (order by SalesOrderIntNumber) NextSalesOrderIntNumber,

[ExtendedAmount], OrderDateKey

from

(SELECT SalesOrderNumber, convert(int, substring(SalesOrderNumber,3,255))

SalesOrderIntNumber, [ExtendedAmount], OrderDateKey

FROM [FactInternetSales] where [SalesOrderLineNumber]=1

) TransactionData

) TransactionDataSequence

group by orderdatekey/100

order by orderdatekey/100

Plotting the result over the time gives a good overview. For my example data, quality improved a lot since August 2007.

Conclusion: This is another example how window functions provide an elegant solution for solving analytical data tasks. And since this works perfectly on a PDW, the approach works well even with billions of rows of data.

## No comments:

## Post a Comment