Rows and Range, Preceding and Following

SQL Server 2012 adds many new features to Transact SQL (T-SQL).  One of my favorites is the Rows/Range enhancements to the over clause. These enhancements are often times referred to as the windowing functions.

Overview:
ROWS PRECEDING, FOLLOWING, UNBOUNDED, refers to the current row and those before or after based on preceding or following.
RANGE PRECEDING, FOLLOWING, UNBOUNDED, means all values in the current range and those before or after.

An example of a need for rows preceding would be to calculate a 3 year trailing average, which needs to look at the current year and the three previous years in the calculation.

Terminology:
ROWS or RANGE- specifying rows or range.
PRECEDING – get rows before the current one.
FOLLOWING – get rows after the current one.
UNBOUNDED – when used with PRECEDING or FOLLOWING, it returns all before or after.
CURRENT ROW

To start out we need a database to work on, so we will create a tsql2012 database. you can use your own database if you wish.

CREATE DATABASE [tsql2012];
GO
USE [tsql2012];

Next we create a table to use for the over clause enhancments of rows and range preceding and following.

-- Table to be used by Over Clause Rows/Range

CREATE TABLE REVENUE
(
[DepartmentID] int,
[Revenue] int,
[Year] int
);

insert into REVENUE
values (1,10030,1998),(2,20000,1998),(3,40000,1998),
 (1,20000,1999),(2,60000,1999),(3,50000,1999),
 (1,40000,2000),(2,40000,2000),(3,60000,2000),
 (1,30000,2001),(2,30000,2001),(3,70000,2001),
 (1,90000,2002),(2,20000,2002),(3,80000,2002),
 (1,10300,2003),(2,1000,2003), (3,90000,2003),
 (1,10000,2004),(2,10000,2004),(3,10000,2004),
 (1,20000,2005),(2,20000,2005),(3,20000,2005),
 (1,40000,2006),(2,30000,2006),(3,30000,2006),
 (1,70000,2007),(2,40000,2007),(3,40000,2007),
 (1,50000,2008),(2,50000,2008),(3,50000,2008),
 (1,20000,2009),(2,60000,2009),(3,60000,2009),
 (1,30000,2010),(2,70000,2010),(3,70000,2010),
 (1,80000,2011),(2,80000,2011),(3,80000,2011),
 (1,10000,2012),(2,90000,2012),(3,90000,2012);

Take a look at the revenue table to see whats there before we start the demo.


USE [tsql2012];

-- first lets look at the REVENUE table

SELECT *
 FROM Revenue;

rows_range_preceding_following1
Contents of the REVENUE table.

Then a quick review of sum and avg.

-- first simple sum and avg aggregates
SELECT sum(Revenue) as TotalRevenue,
 avg(Revenue) as AverageRevenue,
 count(*) as NumRows
 FROM Revenue;

rows_range_preceding_following2

OVER Clause Pre SQL Server 2012

The OVER clause before SQL Server 2012 is really handy to perform aggregates over a different range than your standard grouping.

--First OVER Clause pre SQL 2012
SELECT *,
 avg(Revenue) OVER (PARTITION by DepartmentID) as AverageDeptRevenue,
 sum(Revenue) OVER (PARTITION by DepartmentID) as TotalDeptRevenue
FROM REVENUE
ORDER BY departmentID, year;

rows_range_preceding_following3

ROWS PRECEDING

Now the new features, ROWS PRECEDING specifies the the aggregate functions in the current partition in the OVER clause will consider the current row, and a specific number of rows before the current row.


--ROWS PRECEDING
-- look at the sum of revenue over a trailing 3 year period
SELECT Year, DepartmentID, Revenue,
 sum(Revenue) OVER (PARTITION by DepartmentID
 ORDER BY [YEAR]
 ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) as CurrentAndPrev3
FROM REVENUE
ORDER BY departmentID, year;

rows_range_preceding_following4

ROWS FOLLOWING

Now rows following… The ROWS FOLLOWING option specifies a specific number of rows in the current partition to use after the current row.

-- ROWS FOLLOWING
SELECT Year, DepartmentID, Revenue,
 sum(Revenue) OVER (PARTITION by DepartmentID
 ORDER BY [YEAR]
 ROWS BETWEEN CURRENT ROW AND 3 FOLLOWING) as CurrentAndNext3
FROM REVENUE
ORDER BY departmentID, year;

rows_range_preceding_following5

Both PRECEDING and FOLLOWING

Using both ROWS PRECEDING and ROWS FOLLOWING allows you to do things like calculate an average including the current year and years both before and after the current year.


--ROWS PRECEDING AND FOLLOWING

SELECT Year, DepartmentID, Revenue,
 sum(Revenue) OVER (PARTITION by DepartmentID
 ORDER BY [YEAR]
 ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) as BeforeAndAfter
FROM REVENUE
ORDER BY departmentID, year;

rows_range_preceding_following6

UNBOUNDED

UNBOUNDED PRECEDING tells the windowing function and aggregrates to use the current value, and all values in the partition before the current value.

-- ROWS UNBOUNDED PRECEDING
SELECT Year, DepartmentID, Revenue,
 min(Revenue) OVER (PARTITION by DepartmentID
 ORDER BY [YEAR]
 ROWS UNBOUNDED PRECEDING) as MinRevenueToDate
FROM REVENUE
ORDER BY departmentID, year;

rows_range_preceding_following7

When using UNBOUNDED FOLLOWING as follows, this means the use the current row, and consider all rows after the current row in the current partition.

-- ROWS UNBOUNDED FOLLOWING
-- http://stevestedman.com/?p=1485
SELECT Year, DepartmentID, Revenue,
 min(Revenue) OVER (PARTITION by DepartmentID
 ORDER BY [YEAR]
 ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) as MinRevenueBeyond
FROM REVENUE
ORDER BY departmentID, year;

rows_range_preceding_following8

ROWS vs RANGE

ROWS vs RANGE appears to cause much confusion.

ROWS means the specific row or rows specified, and RANGE refers to those same rows plus any others that have the same matching values.

Here is an example of ROWS and RANGE producing the exact same results because the years and department IDs don’t have any duplicates. In this case the ROWS and RANGE are identical.

-- ROWS vs RANGE UNBOUNDED PRECEDING

SELECT Year, DepartmentID, Revenue,
 sum(Revenue) OVER (PARTITION by DepartmentID
 ORDER BY [YEAR]
 ROWS UNBOUNDED PRECEDING) as RowsCumulative,
 sum(Revenue) OVER (PARTITION by DepartmentID
 ORDER BY [YEAR]
 RANGE UNBOUNDED PRECEDING) as RangeCumulative
FROM REVENUE
WHERE year between 2003 and 2008
ORDER BY departmentID, year;

rows_range_preceding_following9

Now if we consider the same query of ROWS vs RANGE with duplicate values. Here we insert a duplicate set of values for 2005, and the results below show that for the year 2005 there are two rows in the result set, and the RowsCumulative column is different form the RangeCumulative.  The rows counts the specific rows it the aggregate calculation, the range counts the current row, plus the other 2005 rows for that department id.

-- INSERT A DUPLICATE VALUE FOR RANGE UNBOUNDED PRECEEDING

INSERT INTO REVENUE
VALUES (1,10000,2005),(2,20000,2005),(3,30000,2005);

-- same query as above
SELECT Year, DepartmentID, Revenue,
 sum(Revenue) OVER (PARTITION by DepartmentID
 ORDER BY [YEAR]
 ROWS UNBOUNDED PRECEDING) as RowsCumulative,
 sum(Revenue) OVER (PARTITION by DepartmentID
 ORDER BY [YEAR]
 RANGE UNBOUNDED PRECEDING) as RangeCumulative
FROM REVENUE
WHERE year between 2003 and 2008
ORDER BY departmentID, year;

rows_range_preceding_following10

Summary

With all the great new features in SQL Server 2012, the windowing functions, ROWS and RANGE, PRECEDING and FOLLOWING updates to the OVER clause are a great addition.

Home after Seattle (Redmond) Code Camp 2012

What a great day at  Code Camp today.

Click here to download the slides and sample code from the presentations.

The first presentation that I attended was  Accelerating your Development Workflow presentation by Ian Davis.  What a great presentation, it touched on the lean and agile development processes, it was very well done.

The next one was a PowerShell presentation that was okay, then I presented the following three sessions.

What’s New in TSQL 2012

Unleashing Common Table Expressions in SQL Server

The CTE presentation is currently my most polished presentation, it was first presented at SQLSaturday in Redmond a few months back.

SQL Server Performance for Developers

I hope that everyone who attended had a great time, and learned something new!

Code Camp in Redmond Today—

I am headed to Code Camp in Redmond WA today.  I ended up with 3 sessions that I will be presenting, and I am really looking forward to the day.

I am looking forward to attending the Accelerating your Development Workflow presentation by Ian Davis.

Here is what I am presenting…

What’s New in TSQL 2012

This is the first time that I will be presenting this one.  Its going to be a lot of fun working with the new features in TSQL 2012.

Here is the summary of what I will be covering:

Unleashing Common Table Expressions in SQL Server

This will be about the 5th time I have used this presentation.  Just some quick polish and it will be ready to go.  I recently presented this session at SQL Saturday in Redmond and SQL Saturday in Vancouver BC.  One new addition to this session is the performance comparison between the SQL Server 2012 OFFSET/FETCH functionality and the data paging capability of a CTE.

Here is the outline of the presentation:

SQL Server Performance for Developers

Probably one of the most valuable database sessions for developers, although until you have experience bad performance with SQL Server it’s a hard sell.

  • —Is Tuning Necessary for the Developer
  • —Understanding Tables and Indexes
  • —Understanding Execution Plans
  • —Statistics IO and TIME
  • —Understanding Table Size
  • —Procedure Cache and Parameterization
  • —Seven Deadly Sins
    • —Poor or No Database Design
    • —Index Design Issues
    • —RBAR instead of Sets
    • —Not using explicit column lists
    • —Calculations in the WHERE Clause
    • —Dirty Reads
    • —Believing Moore’s Law Will Save You
  • —Tips For Writing Queries

Code Camp is today in Redmond WA, I hope to see you there.

Sessions for Code Camp in 2 weeks

Here are the sessions that I will be presenting at Code Camp in two weeks..

What’s New in TSQL 2012

This is the first time that I will be presenting this one.  Its going to be a lot of fun working with the new features in TSQL 2012.

Unleashing Common Table Expressions in SQL Server

This will be about the 5th time I have used this presentation.  Just some quick polish and it will be ready to go.  I recently presented this session at SQL Saturday in Redmond and SQL Saturday in Vancouver BC.

SQL Server Performance for Developers

Probably one of the most valuable database sessions for developers, although it is hard to convince developers of that.

Code Camp will be on Saturday June 16th 2012 in Redmond WA.

Cumulative Distribution Function (CDF) – Analyzing the Roll of Dice with TSQL

After the last post on Cumulative Distribution Function (CDF) or as it is known in TSQL CUME_DIST(), I realized that although I showed how to use it, I didn’t really explain what it means, or when to use it.  That is where this example comes in.

First lets take an example that generates simulated dice rolls.  What are the odds that when you roll two six sided dice that the number will come up a 12 or a 2 compared to a 6, 7 or 8.   Lets look at an example.

First off two ways to do this;  First I could use the Rand() function to generate numbers between 2 and 12.  Given that the Rand() was truly random that would show that the odds are exactly the same for any number between 2 and 12, but that is not the way that dice behave.

Instead the right way to do it is to use the Rand() function twice to simulate two dice rolling numbers between 1 and 6, then add the two together.  Here is how we do that in TSQL.


GO
USE [analytics_demo];
GO
CREATE TABLE DiceRolls
(
 [FirstDie] int,
 [SecondDie] int
);

GO
set nocount on;
DECLARE @maxRandomValue int = 6;
DECLARE @minRandomValue int = 1;

insert into DiceRolls
values(Cast(((@maxRandomValue + 1) - @minRandomValue) * Rand() + @minRandomValue As int) ,
 Cast(((@maxRandomValue + 1) - @minRandomValue) * Rand() + @minRandomValue As int));

GO 1000

set nocount off;
SELECT FirstDie + SecondDie as TotalRoll FROM DiceRolls;

When you run this you will see that your results of the final SELECT statement are numbers between 2 and 12.

Looking that the 30 results that you see above, that is not a statistically significant number of rolls to determine the probability of each roll combination.  But the table contains 100 rows, so lets take a look at it with CUME_DIST().


SELECT distinct TotalRoll,
 CUME_DIST() OVER (ORDER BY TotalRoll) as CumulativeDist
FROM (SELECT FirstDie + SecondDie as TotalRoll FROM DiceRolls) as t
order by TotalRoll;

Which produces the following output:

Which tells you that the odds of rolling a 2 are 0.018 or just short of 2%.  The odds of rolling a 3 or less are 7% but the odds of it being a 3 are 7% – 2% = 5%.  So if you were betting on the roll of the dice you could see that it is more than twice as likely to roll a 3 as it is a 2.

Next we take the results and drop it into Excel to create the chart below, where the steeper the line, the more likely it is to be rolled, and the flatter the line, the more unlikely it is to be rolled.

From this you can see that the numbers in the middle of the range 6 to 8 are more likely to be rolled than the outside of the range 2 to 12.  The reason for that is that when rolling 2 six sided dice, there is only one combination that will produce a 2, and there are different combinations that will produce a 3, and even more combinations to produce a 7.

2:   1+1

3:  1+2 or 2+1

4:  1+3 or 2+2 or 3+1

5: 1+4 or 2+3 or 3+2 or 4+1

6: 1+5 or 2+4 or 3+3 or 4+2 or 5+1

7: 1+6 or 2+5 or 3+4 or 4+3 or 3+4 or 5+2 or 1+6

8: 2+6 or 3+5 or 4+4 or 5+3 or 6+2

9: 3+6 or 4+5 or 5+4 or 6+3

10: 6+4 or 5+5 or 4+6

11: 5+6 or 6+5

12: 6+6

Which shows that 7 is the most probably roll.  Now lets change the query a bit to see what we can come up with that would look more like our list above:

</pre>
SELECT CD.TotalRoll, CD.CumulativeDist, isnull(LAG(CD.CumulativeDist) OVER(ORDER BY TotalRoll), 0) as LastCumulativeDist
from ( SELECT distinct TotalRoll,
 CUME_DIST() OVER (ORDER BY TotalRoll) as CumulativeDist
 FROM (SELECT FirstDie + SecondDie as TotalRoll FROM DiceRolls) as t) CD
order by TotalRoll;
<pre>

Using the LAG and OVERclause we are able to look at both the probability of the roll being less than or equal the current value, and the less than or equal to the previous value. Which isn’t that useful until we do some math…

</pre>
SELECT CD.TotalRoll, CD.CumulativeDist - isnull(LAG(CD.CumulativeDist) OVER(ORDER BY TotalRoll), 0) as OddsOfThisRoll
from ( SELECT distinct TotalRoll,
 CUME_DIST() OVER (ORDER BY TotalRoll) as CumulativeDist
 FROM (SELECT FirstDie + SecondDie as TotalRoll FROM DiceRolls) as t) CD
order by TotalRoll;
<pre>

Now we can see that the odds of rolling a 2 are 1.8%, and the odds of a 7 are 16.5%. Where it gets interesting is that if you compare 6 and 8 which should have the same number of combinations the 6 has odds of 14.5% and the 8 has odds of 12.9%, which doesn’t sound right. This is perhaps because we don’t have a large enough set to be statistically significant. So now to run the code for the roll 10,000 times lets see if the odds change

</pre>
-- to 10,000 rolls
delete from DiceRolls;
GO
set nocount on;
DECLARE @maxRandomValue int = 6;
DECLARE @minRandomValue int = 1;

insert into DiceRolls
values(Cast(((@maxRandomValue + 1) - @minRandomValue) * Rand() + @minRandomValue As int) ,
 Cast(((@maxRandomValue + 1) - @minRandomValue) * Rand() + @minRandomValue As int));
GO 10000
set nocount off;

SELECT CD.TotalRoll,
 round((CD.CumulativeDist - isnull(LAG(CD.CumulativeDist) OVER(ORDER BY TotalRoll), 0)) * 100, 1) as OddsOfThisRoll
from ( SELECT distinct TotalRoll,
 CUME_DIST() OVER (ORDER BY TotalRoll) as CumulativeDist
 FROM (SELECT FirstDie + SecondDie as TotalRoll FROM DiceRolls) as t) CD
order by TotalRoll;
<pre>

Which brings the odds closer together. But the still aren’t the same for 6 and 8 there are the same for 5 and 9, but not for 4 and 10 or 3 and 11 or 2 and 12. So lets push it out to 100,000 simulated dice rolls. Which brings the odds for 6/8, 5/9, 4/10, 3/11 and 2/12 closer together, but still not exactly the same. Lets move it on to 1,000,000 simulated dice rolls

</pre>
-- to 1,000,000 rolls
delete from DiceRolls;
GO
set nocount on;
DECLARE @maxRandomValue int = 6;
DECLARE @minRandomValue int = 1;

insert into DiceRolls
values(Cast(((@maxRandomValue + 1) - @minRandomValue) * Rand() + @minRandomValue As int) ,
 Cast(((@maxRandomValue + 1) - @minRandomValue) * Rand() + @minRandomValue As int));
GO 1000000
set nocount off;

SELECT CD.TotalRoll,
 round((CD.CumulativeDist - isnull(LAG(CD.CumulativeDist) OVER(ORDER BY TotalRoll), 0)) * 100, 1) as OddsOfThisRoll
from ( SELECT distinct TotalRoll,
 CUME_DIST() OVER (ORDER BY TotalRoll) as CumulativeDist
 FROM (SELECT FirstDie + SecondDie as TotalRoll FROM DiceRolls) as t) CD
order by TotalRoll;
<pre>

You can see that the more times we roll the simulated dice we get closer and closer to matching numbers for the 6/8, and other numbers that should have the same odds.

Next lets take the output and drop it into Excel to see what the curve looks like.

Not exactly a bell curve, but you can see that the odds of rolling a seven is just over 16% which is the most likely roll.

That’s it for statistics for today with CUME_DIST in TSQL on SQL Server 2012. I hope this helps you to understand what you can do with CUME_DIST on SQL Server. This is one of my favorite new Analytics functions.

More TSQL 2012 Analytics CUME_DIST – Cumulative Distribution

Continuing on the TSQL 2012 Analytic Series now on to the CUME_DIST function

SQL Server 2012 introduces another new analytic function.  The Cumulative Distribution Function CUME_DIST()  refers to the probability that the value of a random variable falls within a specified range.

CUME_DIST  is the function that maps values to their percentile rank in a distribution.  CUME_DIST function calculates the possibility of another occurrence being of that value or lesser than that among a group of values.

Lets jump into the examples:


-- CUME_DIST for Year
select DepartmentID, Revenue, Year,
 CUME_DIST() OVER (ORDER BY Year) as CumulativeDistributionYear
 from REVENUE
 where DepartmentID = 1
 order by Year;

-- CUME_DIST for Revenue
select DepartmentID, Revenue, Year,
 CUME_DIST() OVER (ORDER BY Revenue) as CumulativeDistributionRev
 from REVENUE
 where DepartmentID = 1
 order by Revenue;

Which produces the following output: In the first example with the years being the range looked at from 1999 to 2012, the difference between each cumulative distribution value is 7.1% which with an even calculation could be determined by dividing 100 by the number of values wich produces 7.1% (.0714285….). That’s the easy one. You don’t need a TSQL Function to calculate 100 divided by the the number of rows.

Now on the the second distribution which calculates the distribution over the revenue numbers. I have added another column to the example above which rounds the Cumulative Distribution to a percentage with one decimal.

</pre>
select DepartmentID, Revenue, Year,
 CUME_DIST() OVER (ORDER BY Revenue) as CumulativeDistributionRev,
 ROUND(CUME_DIST() OVER (ORDER BY Revenue) * 100, 1) as PercentLessThanOrEqual
 from REVENUE
 where DepartmentID = 1
 order by Revenue;

From here you can see that the lowest Revenue value of 10,000 was hit in both 2004 and 2012.  Keeping in mind the number 7.14% from the previous example you can see that 14.3% is double 7.14% since there are two years with a $10,000 revenue.

This can be useful to determine the percentage of values that are less than or equal to the current value.

That’s it for today.  I hope you find CUME_DIST to be useful.