Working on my Presentations for SQL Connections Fall 2012

Over the last week I have pitched 8 presentations for SQL Connections in Vegas at the end of October 2012.  Hoping that 3 of my presentations get accepted.  Here is the list so far.

  • Using the New Analytic Functions in SQL Server 2012
  • Exploring the TSQL Enhancements in SQL Server 2012
  • Unleashing Common Table Expressions in SQL Server
  • Using SSRS 2012 Reports to Analyze SQL Server Health
  • Reusable BI Components in SSRS Reports
  • New DMV’s in SQL Server 2012 and 2008R2
  • Mitigating Risk and Minimizing Downtime with SQL Server Upgrades
  • Using the Under-appreciated OUTPUT Clause

 

It should be a good time, Vegas is always interesting at Halloween.

 

Risk Assessment Matrix for SQL Server Upgrades

A couple years ago in my Incident Safety Officer class with the fire department we studied risk assessment for any incident that the fire department may respond to.  Since that time I have used a similar risk assessment matrix for SQL Server upgrades, and other big SQL Server maintenance tasks.

The risk assessment matrix as it is used by the Incident Safety Officer with the fire department may make the difference in saving lives, or preventing accidents.  The risk assessment matrix can be used by the database administrator to reduce risk and to reduce the risk of being fired if something dos go wrong.

Here is how it works, start with the grid as shown here:

The left axis has probability from low to high, and the top axis has impact from low to high.

Then run through the possible things that could go wrong with the task that you are working on, and determine where they fit in the matrix.

Then focus on mitigating the risk with the things that fall into the high risk block.  Keep in mind that the High Probility, High Impact = High Risk items are the items that will get you fired, and the Medium Risk items are the things that might get you fired.

Give it a try next time you are doing a database upgrade, or some type of system maintenance.

 

Cumulative Distribution Function (CDF) – Analyzing the Roll of Dice with TSQL

After the last post on Cumulative Distribution Function (CDF) or as it is known in TSQL CUME_DIST(), I realized that although I showed how to use it, I didn’t really explain what it means, or when to use it.  That is where this example comes in.

First lets take an example that generates simulated dice rolls.  What are the odds that when you roll two six sided dice that the number will come up a 12 or a 2 compared to a 6, 7 or 8.   Lets look at an example.

First off two ways to do this;  First I could use the Rand() function to generate numbers between 2 and 12.  Given that the Rand() was truly random that would show that the odds are exactly the same for any number between 2 and 12, but that is not the way that dice behave.

Instead the right way to do it is to use the Rand() function twice to simulate two dice rolling numbers between 1 and 6, then add the two together.  Here is how we do that in TSQL.


GO
USE [analytics_demo];
GO
CREATE TABLE DiceRolls
(
 [FirstDie] int,
 [SecondDie] int
);

GO
set nocount on;
DECLARE @maxRandomValue int = 6;
DECLARE @minRandomValue int = 1;

insert into DiceRolls
values(Cast(((@maxRandomValue + 1) - @minRandomValue) * Rand() + @minRandomValue As int) ,
 Cast(((@maxRandomValue + 1) - @minRandomValue) * Rand() + @minRandomValue As int));

GO 1000

set nocount off;
SELECT FirstDie + SecondDie as TotalRoll FROM DiceRolls;

When you run this you will see that your results of the final SELECT statement are numbers between 2 and 12.

Looking that the 30 results that you see above, that is not a statistically significant number of rolls to determine the probability of each roll combination.  But the table contains 100 rows, so lets take a look at it with CUME_DIST().


SELECT distinct TotalRoll,
 CUME_DIST() OVER (ORDER BY TotalRoll) as CumulativeDist
FROM (SELECT FirstDie + SecondDie as TotalRoll FROM DiceRolls) as t
order by TotalRoll;

Which produces the following output:

Which tells you that the odds of rolling a 2 are 0.018 or just short of 2%.  The odds of rolling a 3 or less are 7% but the odds of it being a 3 are 7% – 2% = 5%.  So if you were betting on the roll of the dice you could see that it is more than twice as likely to roll a 3 as it is a 2.

Next we take the results and drop it into Excel to create the chart below, where the steeper the line, the more likely it is to be rolled, and the flatter the line, the more unlikely it is to be rolled.

From this you can see that the numbers in the middle of the range 6 to 8 are more likely to be rolled than the outside of the range 2 to 12.  The reason for that is that when rolling 2 six sided dice, there is only one combination that will produce a 2, and there are different combinations that will produce a 3, and even more combinations to produce a 7.

2:   1+1

3:  1+2 or 2+1

4:  1+3 or 2+2 or 3+1

5: 1+4 or 2+3 or 3+2 or 4+1

6: 1+5 or 2+4 or 3+3 or 4+2 or 5+1

7: 1+6 or 2+5 or 3+4 or 4+3 or 3+4 or 5+2 or 1+6

8: 2+6 or 3+5 or 4+4 or 5+3 or 6+2

9: 3+6 or 4+5 or 5+4 or 6+3

10: 6+4 or 5+5 or 4+6

11: 5+6 or 6+5

12: 6+6

Which shows that 7 is the most probably roll.  Now lets change the query a bit to see what we can come up with that would look more like our list above:

</pre>
SELECT CD.TotalRoll, CD.CumulativeDist, isnull(LAG(CD.CumulativeDist) OVER(ORDER BY TotalRoll), 0) as LastCumulativeDist
from ( SELECT distinct TotalRoll,
 CUME_DIST() OVER (ORDER BY TotalRoll) as CumulativeDist
 FROM (SELECT FirstDie + SecondDie as TotalRoll FROM DiceRolls) as t) CD
order by TotalRoll;
<pre>

Using the LAG and OVERclause we are able to look at both the probability of the roll being less than or equal the current value, and the less than or equal to the previous value. Which isn’t that useful until we do some math…

</pre>
SELECT CD.TotalRoll, CD.CumulativeDist - isnull(LAG(CD.CumulativeDist) OVER(ORDER BY TotalRoll), 0) as OddsOfThisRoll
from ( SELECT distinct TotalRoll,
 CUME_DIST() OVER (ORDER BY TotalRoll) as CumulativeDist
 FROM (SELECT FirstDie + SecondDie as TotalRoll FROM DiceRolls) as t) CD
order by TotalRoll;
<pre>

Now we can see that the odds of rolling a 2 are 1.8%, and the odds of a 7 are 16.5%. Where it gets interesting is that if you compare 6 and 8 which should have the same number of combinations the 6 has odds of 14.5% and the 8 has odds of 12.9%, which doesn’t sound right. This is perhaps because we don’t have a large enough set to be statistically significant. So now to run the code for the roll 10,000 times lets see if the odds change

</pre>
-- to 10,000 rolls
delete from DiceRolls;
GO
set nocount on;
DECLARE @maxRandomValue int = 6;
DECLARE @minRandomValue int = 1;

insert into DiceRolls
values(Cast(((@maxRandomValue + 1) - @minRandomValue) * Rand() + @minRandomValue As int) ,
 Cast(((@maxRandomValue + 1) - @minRandomValue) * Rand() + @minRandomValue As int));
GO 10000
set nocount off;

SELECT CD.TotalRoll,
 round((CD.CumulativeDist - isnull(LAG(CD.CumulativeDist) OVER(ORDER BY TotalRoll), 0)) * 100, 1) as OddsOfThisRoll
from ( SELECT distinct TotalRoll,
 CUME_DIST() OVER (ORDER BY TotalRoll) as CumulativeDist
 FROM (SELECT FirstDie + SecondDie as TotalRoll FROM DiceRolls) as t) CD
order by TotalRoll;
<pre>

Which brings the odds closer together. But the still aren’t the same for 6 and 8 there are the same for 5 and 9, but not for 4 and 10 or 3 and 11 or 2 and 12. So lets push it out to 100,000 simulated dice rolls. Which brings the odds for 6/8, 5/9, 4/10, 3/11 and 2/12 closer together, but still not exactly the same. Lets move it on to 1,000,000 simulated dice rolls

</pre>
-- to 1,000,000 rolls
delete from DiceRolls;
GO
set nocount on;
DECLARE @maxRandomValue int = 6;
DECLARE @minRandomValue int = 1;

insert into DiceRolls
values(Cast(((@maxRandomValue + 1) - @minRandomValue) * Rand() + @minRandomValue As int) ,
 Cast(((@maxRandomValue + 1) - @minRandomValue) * Rand() + @minRandomValue As int));
GO 1000000
set nocount off;

SELECT CD.TotalRoll,
 round((CD.CumulativeDist - isnull(LAG(CD.CumulativeDist) OVER(ORDER BY TotalRoll), 0)) * 100, 1) as OddsOfThisRoll
from ( SELECT distinct TotalRoll,
 CUME_DIST() OVER (ORDER BY TotalRoll) as CumulativeDist
 FROM (SELECT FirstDie + SecondDie as TotalRoll FROM DiceRolls) as t) CD
order by TotalRoll;
<pre>

You can see that the more times we roll the simulated dice we get closer and closer to matching numbers for the 6/8, and other numbers that should have the same odds.

Next lets take the output and drop it into Excel to see what the curve looks like.

Not exactly a bell curve, but you can see that the odds of rolling a seven is just over 16% which is the most likely roll.

That’s it for statistics for today with CUME_DIST in TSQL on SQL Server 2012. I hope this helps you to understand what you can do with CUME_DIST on SQL Server. This is one of my favorite new Analytics functions.

More TSQL 2012 Analytics CUME_DIST – Cumulative Distribution

Continuing on the TSQL 2012 Analytic Series now on to the CUME_DIST function

SQL Server 2012 introduces another new analytic function.  The Cumulative Distribution Function CUME_DIST()  refers to the probability that the value of a random variable falls within a specified range.

CUME_DIST  is the function that maps values to their percentile rank in a distribution.  CUME_DIST function calculates the possibility of another occurrence being of that value or lesser than that among a group of values.

Lets jump into the examples:


-- CUME_DIST for Year
select DepartmentID, Revenue, Year,
 CUME_DIST() OVER (ORDER BY Year) as CumulativeDistributionYear
 from REVENUE
 where DepartmentID = 1
 order by Year;

-- CUME_DIST for Revenue
select DepartmentID, Revenue, Year,
 CUME_DIST() OVER (ORDER BY Revenue) as CumulativeDistributionRev
 from REVENUE
 where DepartmentID = 1
 order by Revenue;

Which produces the following output: In the first example with the years being the range looked at from 1999 to 2012, the difference between each cumulative distribution value is 7.1% which with an even calculation could be determined by dividing 100 by the number of values wich produces 7.1% (.0714285….). That’s the easy one. You don’t need a TSQL Function to calculate 100 divided by the the number of rows.

Now on the the second distribution which calculates the distribution over the revenue numbers. I have added another column to the example above which rounds the Cumulative Distribution to a percentage with one decimal.

</pre>
select DepartmentID, Revenue, Year,
 CUME_DIST() OVER (ORDER BY Revenue) as CumulativeDistributionRev,
 ROUND(CUME_DIST() OVER (ORDER BY Revenue) * 100, 1) as PercentLessThanOrEqual
 from REVENUE
 where DepartmentID = 1
 order by Revenue;

From here you can see that the lowest Revenue value of 10,000 was hit in both 2004 and 2012.  Keeping in mind the number 7.14% from the previous example you can see that 14.3% is double 7.14% since there are two years with a $10,000 revenue.

This can be useful to determine the percentage of values that are less than or equal to the current value.

That’s it for today.  I hope you find CUME_DIST to be useful.

More TSQL 2012 Analytics PERCENTILE_DISC and PERCENTILE_CONT

Like the other new Analytic functions, PERCENTILE_DISC and PERCENTILE_CONT require the use of the OVER clause.

For this example I will be using almost the same revenue table in the sample database that I set up for the LEAD and LAG posting earlier in the week, and the PERCENT_RANK posting yesterday, just a few values changed to show the differences between these two functions.


CREATE DATABASE [analytics_demo];
GO

USE [analytics_demo];

-- same Revenue Table used in previous examples of the OVER clause

CREATE TABLE REVENUE
(
[DepartmentID] int,
[Revenue] int,
[Year] int
);

insert into REVENUE
values (1,20000,1999),(2,60000,1999),(3,49000,1999),
 (1,40000,2000),(2,40000,2000),(3,60000,2000),
 (1,30000,2001),(2,30000,2001),(3,700,2001),
 (1,90000,2002),(2,20000,2002),(3,80000,2002),
 (1,10300,2003),(2,1000,2003), (3,900,2003),
 (1,10000,2004),(2,10000,2004),(3,10000,2004),
 (1,20000,2005),(2,20000,2005),(3,20000,2005),
 (1,40000,2006),(2,30000,2006),(3,300,2006),
 (1,70000,2007),(2,40000,2007),(3,40000,2007),
 (1,50000,2008),(2,50000,2008),(3,42000,2008),
 (1,20000,2009),(2,60000,2009),(3,600,2009),
 (1,30000,2010),(2,70000,2010),(3,700,2010),
 (1,80000,2011),(2,80000,2011),(3,800,2011),
 (1,10000,2012),(2,90000,2012),(3,900,2012);

With these two percentile functions the main difference is that PERCENTILE_CONT interpolates the appropriate value, even if it is not in the set, and PERCENTILE_DISC picks a percentile that exists in the set.  Keep in mind that any nulls in the data set are ignored.  The CONT stands for continuous, and DISC stands for discrete distribution which means that the percentile will exist in the set.

Here is how they work.

First PERCENTILE_CONT to calculate the 90th percentile:


-- now on to PERCENTILE_CONT to calculate the 90th percentile
select DepartmentID, Revenue, Year,
 PERCENTILE_CONT(.9)
 WITHIN GROUP(ORDER BY Revenue)
 OVER(PARTITION BY DepartmentID) as Percentile90
 from REVENUE
order by DepartmentID;

which produces these results:

From here you can see that the 90th percentile is 77000, but there is no 77000 in the set, thus the continuous distribution.

Then PERCENTILE_DISC to calculate the 90th percentile (Discrete distribution):


-- now on to PERCENTILE_DISC to calculate the 90th percentile
select DepartmentID, Revenue, Year,
 PERCENTILE_DISC(.9)
 WITHIN GROUP(ORDER BY Revenue)
 OVER(PARTITION BY DepartmentID) as Percentile90
 from REVENUE
order by DepartmentID;

From the output can can see that the 90th percentile for the Revenue at DepartmentID of 1 is 80000, meaning that 90% of the values in the set are less than or equal to 80000.  Since the 90th percentile is calculated using values in the set when using PERCENT_DISC, this gives you a different value than the previous example with PERCENT_CONT.

Not on to calculating the median and comparing DISC and CONT


-- calculating the median with PERCENTILE_DISC and PERCENTILE_CONT
 -- http://stevestedman.com/?p=1533
 select DepartmentID, Revenue, Year,
 PERCENTILE_DISC(.5)
 WITHIN GROUP(ORDER BY Revenue)
 OVER(PARTITION BY DepartmentID) as MedianDisc,
 PERCENTILE_CONT(.5)
 WITHIN GROUP(ORDER BY Revenue)
 OVER(PARTITION BY DepartmentID) as MedianCont
 from REVENUE
order by DepartmentID;

With this example you can see below that Department 1 and Department 2 have the same value for the PERCENTILE_DISC(.5) and PERCENTILE_CONT(.5), but when you look at Department 3 you can see that are very different numbers based on the distribution of this set.

So overall the thing to remember here is the difference between CONT and DISC, in that DISC will include a number that is in your set, and that CONT may include the actual percentile even if it doesn’t fit into your set.

More blogging to come on the TSQL 2012 Analytic function CUME_DIST tomorrow.

Using the OUTPUT Clause in TSQL for Auditing

The OUTPUT clause is often times underappreciated by the TSQL programmer.  There are many really interesting things that you can do with the OUTPUT clause, from inserting to two tables with one statement or to replace the SCOPE_IDENTITY option.  For this posting I am looking into using the OUTPUT clause for auditing.

By auditing in this example, I specifically mean tracking what was changed, and when it was changed.

In the sample below I create a database called output_demo, create a called Departments, add a few rows, create another table called DeptAudit then update Departments and insert to DeptAudit in one statement.


USE [Master];

IF EXISTS(SELECT name FROM sys.databases WHERE name = 'output_demo')
BEGIN
 ALTER DATABASE [output_demo] SET SINGLE_USER WITH ROLLBACK IMMEDIATE;
 DROP DATABASE [output_demo];
END
GO
CREATE DATABASE [output_demo];
GO

USE [output_demo];
-- create a table to use for the demo
CREATE TABLE [Departments] (
 id int, --would normally be an INT IDENTITY
 department VARCHAR (200),
 parent int
 );

CREATE TABLE [DeptAudit] (
 id int, --would normally be an INT IDENTITY
 old_department VARCHAR (200),
 new_department VARCHAR (200),
 parent int,
 ModificationDate DATETIME
 );
-- insert top level departments
insert into [Departments] (id, department, parent)
values (1, 'Camping', null),
 (2, 'Cycle', null),
 (3, 'Snowsports', null),
 (4, 'Fitness', null);

-- now some sub-departments for camping
insert into [Departments] (id, department, parent)
values (5, 'Tents', 1),
 (6, 'Backpacks', 1),
 (7, 'Sleeping Bags', 1),
 (8, 'Cooking', 1);

select * from Departments;

-- here's the magic to update one table and insert into another in one statement

update dept
 set dept.department = 'Camp Kitchen'
output [deleted].id, [deleted].department, 
       [inserted].department, [deleted].parent, GETUTCDATE()
 INTO DeptAudit
 from Departments dept
 where dept.id = 8;

select * from Departments;
select * from DeptAudit;

Which generates this output.

Where you can see that the update statement to change ‘Cooking’ to ‘Camp Kitchen’ was logged into the DeptAudit table showing both the old and the new value.

A similar process could be logged on INSERT, UPDATE, DELETE, or MERGE in order to track anything that is being changed in a table.