The following is Chapter 3 from my Common Table Expressions book it its entirety.
CTEs Instead of Derived Tables
One time I was handed an eight page print out with an 8pt font showing a single SQL statement. It was handed to me with a question about why it was not returning the right data. After looking the query over for a few hours, I discovered that there was a derived table (subquery) that had been used 16 different times in the query. It was designed to do one thing, but somewhere along the way the developer changed it to do something slightly different. When they made that change, they should have updated the derived table query in 16 locations, but instead only the first 13 locations where updated and the last three derived tables were overlooked. The change from the old to the new was a small change that only added one additional join, but it dramatically changed the output of that derived table.
READER NOTE: Please run the CTEBookSetup.sql script in order to follow along with the examples in this chapter. The script mentioned in this chapter is available at http://SteveStedman.com.
In this example I could have just updated those last three derived tables and fixed the problem, but instead I decided to break that derived table out using a CTE. The CTE solution reduced the overall query size from eight pages to a page and a half. It made the query much easier to understand, and eliminated the duplicate derived table query that was so prone to errors.
If you haven’t had to work on a gigantic query like this at one time or another consider yourself lucky. If you continue your career with SQL Server you will eventually come across one of these huge queries and I hope that having a solid working knowledge of CTEs will help you.
In the previous chapter we looked at a very simple derived table query and then wrote a CTE that did something similar. For this chapter we will learn how do to this in a more complex example.
CTEs will give an option so we do not have to create a gigantic query. Here is an example of a big query converted to a smaller, more understandable query using CTEs. The first query using derived tables takes 20 lines of T-SQL code and even has a typo in the third derived table where “Cooking” is missing the “g”. The two queries are intended to produce the same output and would if it weren’t for that typo. It just wasn’t feasible to include an eight-page query in the book; the upcoming example will be similar, just on a smaller scale.
In the following comparison example think about which query would be better to work on, or make changes to:
SELECT q1.department, q2.department as subDepartment,
q3.department as nextLevel
FROM (SELECT id, department, parent
FROM Departments
WHERE archived = 0
AND department != ‘Snowsports’
AND department != ‘Cooking’) as q1
LEFT JOIN (SELECT id, department, parent
FROM Departments
WHERE archived = 0
AND department != ‘Snowsports’
AND department != ‘Cooking’) as q2
ON q1.id = q2.parent
LEFT JOIN (SELECT id, department, parent
FROM Departments
WHERE archived = 0
AND department != ‘Snowsports’
AND department != ‘Cookin’) as q3
ON q2.id = q3.parent
WHERE q1.parent is null;
Now we convert the same query to a CTE and fix the typo in “Cookin” and it only takes 12 lines of T-SQL code.
;WITH DeptCTE(id, department, parent) AS
(
SELECT id, department, parent
FROM Departments
WHERE archived = 0
AND department != ‘Snowsports’
AND department != ‘Cooking’
)
SELECT q1.department, q2.department as subDepartment,
q3.department as nextLevel
FROM DeptCTE as q1
LEFT JOIN DeptCTE as q2 ON q1.id = q2.parent
LEFT JOIN DeptCTE as q3 ON q2.id = q3.parent
WHERE q1.parent is null;
In the CTE version we know that Cooking is spelled correctly by simply looking in one place. There are only 12 lines of code, and the LEFT JOINs are cleaner and what they are doing is more obvious by not being spread over six lines each.
Overwhelming Derived Tables
Once the art converting derived tables into CTEs has been mastered it will be surprising how many of those overwhelming queries with multiple derived table references will no longer be created.
Cleanup with a CTE
The complexity of cleaning up a derived table query and converting it into a CTE is related to the complexity of the derived table query. If we are working with an eight page query with 16 occurrences of a derived table, this is going to be more work than the example we start with here, but the value in fixing up that big query is going to be much greater.
Here are the steps that can be used to convert a derived table query into a CTE.
- Find the first occurrence of the derived table query to be broken out. Create a name for it and add “CTE” to the name. Copy the derived table definition, including the parentheses, and leave the new name as the placeholder.
- Paste the query, copied earlier, above the SELECT statement.
- At the top of the query add the CTE declaration using the same name from step 1.
- Find all other occurrences of the same derived table query and replace them with the CTE name.
- Clean up the formatting and test the query.
To start with, the following query has a derived table query that is used twice:
SELECT q1.department, q2.department as subDepartment
FROM (SELECT id, department, parent
FROM Departments) as q1
INNER JOIN (SELECT id, department, parent
FROM Departments) as q2
ON q1.id = q2.parent
WHERE q1.parent is null;
Figure 3.1 Results from a query that uses the same derived table twice.
We will now walk through the steps to break out a derived table query into a CTE, keeping in mind that the key changes in each step will be bolded in the examples:
READER NOTE: The next step will not run by itself and is there to show what this query looks like when it is partially done.
Step 1. Find the first occurrence of the derived table query intended to be broken out. Create a name for it and add “CTE” to the name. Copy the derived table, including the parentheses, and leave the name as the placeholder.
The first occurrence is located in the 2nd and 3rd lines of the original query. Copy the derived table query out and replace it with the CTE name. The following query should still be on the clipboard.
(SELECT id, department, parent
FROM Departments)
The following query will be left in the query editor.
SELECT q1.department, q2.department as subDepartment
FROM departmentsCTE as q1
INNER JOIN (SELECT id, department, parent
FROM Departments) as q2
ON q1.id = q2.parent
WHERE q1.parent is null;
Step 2. Paste the query copied earlier into the SELECT statement.
The query should now look like the following:
(SELECT id, department, parent
FROM Departments)
SELECT q1.department, q2.department as subDepartment
FROM DepartmentsCTE as q1
INNER JOIN (SELECT id, department, parent
FROM Departments) as q2
ON q1.id = q2.parent
WHERE q1.parent is null;
Step 3. At the top of the query, add the CTE declaration using the same name from Step 1. The query should now look like the following:
;WITH departmentsCTE(id, department, parent)
AS
(
SELECT id, department, parent
FROM Departments
)
SELECT q1.department, q2.department as subDepartment
FROM departmentsCTE as q1
INNER JOIN (SELECT id, department, parent
FROM Departments) as q2
ON q1.id = q2.parent
WHERE q1.parent is null;
Technically, at this point, it is a CTE, but the query is not taking full advantage of the CTE capabilities. There is still another occurrence of the derived query to fix up in step 4.
Step 4. Find all other occurrences of the same derived table query and replace them with the CTE name.
;WITH departmentsCTE(id, department, parent) AS
(
SELECT id, department, parent
FROM Departments
)
SELECT q1.department, q2.department as subDepartment
FROM departmentsCTE as q1
INNER JOIN departmentsCTE as q2
ON q1.id = q2.parent
WHERE q1.parent is null;
The CTE is now complete from a functionality perspective, but it is not very clearly formatted and it hasn’t been tested.
Step 5. (Optional). Clean up the formatting and indenting of the query (if necessary). In our example there is no change from Step 4 since we used proper formatting as we built the query.
;WITH departmentsCTE(id, department, parent) AS
(
SELECT id, department, parent
FROM Departments
)
SELECT q1.department, q2.department as subDepartment
FROM departmentsCTE as q1
INNER JOIN departmentsCTE as q2
ON q1.id = q2.parent
WHERE q1.parent is null;
When cleaning up the formatting, it should be cleaned up to match the coding standards of an organization if it has them. It doesn’t have to exactly match what is shown in the previous example, as long as it is meeting the local coding standards.
Once formatted, it’s time to move on to testing the CTE. It is common to start testing by running the original query and comparing the results to the CTE just created. Depending on the CTE, more extensive testing may be needed, but generally, if the steps were followed correctly we will get the same results as the original query.
Figure 3.2 Output from the newly converted CTE.
Comparing these results to the original derived table query will show that they are the same.
Now imagine using the same process to reduce an 8 page print out of T-SQL down to a page and a half by replacing 16 instances of the same derived table with one CTE that is referenced 16 times. Steps 1, 2 and 3 would be the same. Step 4 would require finding all remaining instances of the derived table, and step 5 would be the same. Once complete, having to read through a page and a half of T-SQL to find a problem is much easier than trying to understand 8 pages.
When going through the process of converting derived table queries into CTEs, keep in mind the CTE is not simply a replacement for every derived table. There are many places where a derived table is still a better option than a CTE. It is also common to have a CTE query that contains derived tables. When working through conversions of Derived Tables into CTEs it will become clearer where it is appropriate. A good check is to ask the question of which is more complex, the CTE version of the query, or the derived table version of the query.
Performance Considerations
One of the tendencies of developers when they learn a new trick or how to use a new tool is to start using it everywhere without a full understanding of all the implications. I know because I have done that exact same thing. I have been disappointed when I started using something new only to discover that there was some key performance flaw with it.
An example of this was with temp tables. When I first learned about temp tables I was so excited I started using them more than I should have. As I used them more I started to realize some of the performance implications around temp tables in certain scenarios, and later, after a fair amount of foul language, I stopped using temp tables all together. Over time, with experience and maturity, I learned more about how temp tables perform. I finally admitted that there is a right time and a wrong time to use a temp table. Today I am not opposed to temp tables as long as their performance is understood.
When I first learned about CTEs, and having made mistakes in the past (specifically with temp tables), I was cautiously excited. I decided to learn more about the performance implications before completely adopting their use.
One of the common misconceptions around CTEs when I first started working with them was that they are a shared result set that gets run only once, even if it is referenced multiple times. There were some bloggers who raved about the performance increase because they claimed the CTE query is only run once. This is not the case. For a CTE with multiple references, the CTE query is run multiple times.
;WITH departmentsCTE(id, department, parent) AS
(
SELECT id, department, parent
FROM Departments
)
SELECT q1.department, q2.department as subDepartment
FROM departmentsCTE as q1
INNER JOIN departmentsCTE as q2 ON q1.id = q2.parent
WHERE q1.parent is null;
The departmentsCTE Common Table Expression is referenced twice. This is considered a self-join because the query is joining the same table twice. Here, the inside CTE query shown will be run twice.
(SELECT id, department, parent
FROM Departments)
To prove this, just include the actual execution plan when running the query. It will produce the following plan that shows the query against the Departments table is being run twice and is causing a table scan to be run twice.
Figure 3.3 Execution plan showing that the Departments table is scanned twice.
Normally we would also want to look into why it is causing a table scan, instead of an index seek or index scan. In this scenario we are getting the table scan because of the size of the table, the lack of indexes, and the lack of any filtering when selecting from the Departments table.
In this example, using a small table, it is not much of a performance issue. If the CTE query was larger and slower, or more load intensive, then multiple references to the same CTE query could have some very significant performance impacts.
The main thing to consider, relating to performance when converting a derived table query over to a CTE, is that it shouldn’t change the performance at all. What was slow as a derived table will still be slow as a CTE. Likewise, what was fast will still be fast.
To run the original derived table query (including the actual execution plan) it will be exactly the same as the CTE version we just looked at.
Figure 3.4 Original derived table query with similar plan to the CTE query.
With recursive CTEs there are some major performance increases to be seen. For the conversion of derived tables to CTEs the main purpose is maintainability, readability, and reducing mistakes, not performance changes.
Reducing Mistakes
For the computer science student, the experienced senior programmer or anyone in between, the topic of clean code brings up many conversations on various beliefs and methodologies. These may range from original C programming books from 35 years ago to the latest agile methodologies or test driven developments (TDD). The one thing that stands out in any programming methodology is that duplicate code causes problems, and reducing duplicate code leads to fewer problems.
Duplicated code in any programming language is generally considered undesirable and leads to problems with understanding, additional bugs and more complication in the overall maintenance process. Most of these were covered on the eight page print out with the same derived table reference being used 16 times.
Another area where mistakes are introduced is when you are writing a new query and then implement an older query that is copied, modified, and then used as the starting point. Often times we end up with a bunch of left-overs that aren’t really needed. To grab the CTE from one query and to reuse it in another query is much safer because there is a clear definition of where the CTE query begins and where it ends.
One way to help reduce mistakes is to use the CTE as a way to debug part of the query. This can be done quickly by commenting out the main query that uses the CTE and replace it with a SELECT * FROM query to view the results from the CTE. The following example comments out part of the main query to see if the CTE logic alone is working:
;WITH departmentsCTE(id, department, parent) AS
(
SELECT id, department, parent
FROM Departments
)
SELECT * FROM departmentsCTE;
–SELECT q1.department, q2.department as subDepartment
— FROM departmentsCTE as q1
— INNER JOIN departmentsCTE as q2 ON q1.id = q2.parent
— WHERE q1.parent is null;
In addition to aiding in the debugging process, proper use of CTEs will lead to better overall T-SQL code by reducing the amount of copy and paste or duplicate code in queries.
Summary
Without a CTE we could eventually end up writing a huge query that is a copy and paste of derived tables, and over time that copy and paste may lead to confusion, errors, and problems in the query. A CTE query reduces the confusion, errors, and problems by consolidating those queries into a Common Table Expression rather than using derived tables that are a copy and paste nightmare.
It is easy to convert a derived table query into a CTE query using the following 5 easy steps.
- Find the first occurrence of the derived table query to break out. Create a name for it and add “CTE” to the name. Copy the derived table, including the parentheses, and leave the name as the placeholder.
- Paste the query, previously copied, to the SELECT statement.
- At the top of the query, add the CTE declaration using the same name from step 1.
- Find all other occurrences of the same derived table query and replace them with the CTE name.
- Clean up the formatting and test the query.
The purpose of replacing a derived table query with a CTE query is to improve the readability and maintainability of the T-SQL code. By following the 5 steps to convert a derived table query into a CTE query their performance should be identical, no better or worse than the original derived table query.
Once the derived table has been converted to a CTE, the CTE can be referenced multiple times in the same query. Keep in mind that each time it is referenced it gets run again.
Learning how to convert a derived table to a CTE does not imply that every derived table should be converted to a CTE. In fact there are many cases where a CTE query will include derived tables. After converting a few Derived Tables to CTEs the understanding of what queries should convert to CTEs, and which are just fine as a derived table will be easier to spot. A good general rule to consider is that if a CTE query is more complex and harder to follow than the derived table query then perhaps it should not be converted to a CTE.
Converting overwhelming derived table queries to CTEs reduces mistakes, improves the debugging process, and creates queries that are much easier to understand.
Points to Ponder – CTEs Instead of Derived Tables
- Most derived tables can be easily converted to CTEs in just a few simple steps.
- A CTE is similar to a derived table but the syntax is different. The code for the CTE is generally more readable than that for a derived table.
- Copy and paste errors can be reduced by using a CTE instead of duplicate derived tables in a large query.
- It is easier to debug a complex query that is broken up with a CTE compared to a similar query using derived tables. The CTEs can easily be run one by one outside of the full query.
- Using a CTE (even with multiple references) doesn’t improve the performance over a similar query written with derived tables, it does improve the readability.
- For a CTE that is referenced multiple times (like a self-join) the CTE query is not reused, it is executed multiple times.
- CTEs should not be considered a replacement for every derived table. There are many occasions where the derived table is just fine, and doesn’t need to be converted to a CTE.
Review Quiz – Chapter Three
- Using a CTE rather than a derived table will increase query performance.
- TRUE
- FALSE
- CTEs can be a better option than a derived table query for these two reasons? (Choose the two correct answers.)
- A CTE can simplify a complex query compared to multiple instances of a derived table.
- Every derived table reference should be replaced with a CTE.
- You get increased performance with a CTE.
- Using a CTE rather than multiple copy and paste versions of the same derived table leads to fewer mistakes.
Answer Key
- The correct answer is FALSE (b). Using a CTE rather than a derived table query performs the same. There is no performance benefit either way. However there is a sustainability and readability improvement.
- A CTE can simplify a complex query over derived table usage, and it can reduce mistakes by not having duplicate code. A CTE does not improve performance, and not every derived table should be converted to a CTE. Answers (a) and (d) are the correct answers.
More from Stedman Solutions:
Steve and the team at Stedman Solutions are here for all your SQL Server needs.
Contact us today for your free 30 minute consultation..
We are ready to help!