The best Postgres feature you're not using – CTEs aka WITH clauses
SQL by default isn’t typically friendly to dive into, and especially so if you’re reading someone else’s already created queries. For some reason most people throw out principles we follow in other languages such as commenting and composability just for SQL. I was recently reminded of a key feature in Postgres that most don’t use by @timonk highlighting it in his AWS Re:Invent Redshift talk. The simple feature actually makes SQL both readable and composable, and even for my own queries capable of coming back to them months later and understanding them, where previously they would not be.
The feature itself is known as CTEs or common table expressions, you may also here it referred to as WITH
clauses. The general idea is that it allows you to create something somewhat equivilant to a view that only exists during that transaction. You can create multiple of these which then allow for clear building blocks and make it simple to follow what you’re doing.
Lets take a look at a nice simple one:
WITH users_tasks AS (
SELECT
users.email,
array_agg(tasks.name) as task_list,
projects.title
FROM
users,
tasks,
project
WHERE
users.id = tasks.user_id
projects.title = tasks.project_id
GROUP BY
users.email,
projects.title
)
Using this I could now just append some basic other query on to the end that references this CTE users_tasks
. Something akin to:
SELECT *
FROM users_tasks;
But where it becomes more interesting is chaining these together. So while I have all tasks assigned to each user here, perhaps I want to then find which users are responsible for more than 50% of the tasks on a given project, thus being the bottleneck. To oversimplify this we could do it a couple of ways, total up the tasks for each project, and then total up the tasks for each user per project:
total_tasks_per_project AS (
SELECT
project_id,
count(*) as task_count
FROM tasks
GROUP BY project_id
),
tasks_per_project_per_user AS (
SELECT
user_id,
project_id,
count(*) as task_count
FROM tasks
GROUP BY user_id, project_id
),
Then we would want to combine and find the users that are now over that 50%:
overloaded_users AS (
SELECT tasks_per_project_per_user.user_id,
FROM tasks_per_project_per_user,
total_tasks_per_project
WHERE tasks_per_project_per_user.task_count > (total_tasks_per_project / 2)
)
Now as a final goal I’d want to get a comma separated list of tasks of the overloaded users. So we’re simply giong to join against that overloaded_users
and our initial list of users_tasks
. Putting it all together it looks somewhat long, but becomes much more readable. And as a bonus I layered in some comments.
--- Created by Craig Kerstiens 11/18/2013
--- Query highlights users that have over 50% of tasks on a given project
--- Gives comma separated list of their tasks and the project
--- Initial query to grab project title and tasks per user
WITH users_tasks AS (
SELECT
users.id as user_id,
users.email,
array_agg(tasks.name) as task_list,
projects.title
FROM
users,
tasks,
project
WHERE
users.id = tasks.user_id
projects.title = tasks.project_id
GROUP BY
users.email,
projects.title
),
--- Calculates the total tasks per each project
total_tasks_per_project AS (
SELECT
project_id,
count(*) as task_count
FROM tasks
GROUP BY project_id
),
--- Calculates the projects per each user
tasks_per_project_per_user AS (
SELECT
user_id,
project_id,
count(*) as task_count
FROM tasks
GROUP BY user_id, project_id
),
--- Gets user ids that have over 50% of tasks assigned
overloaded_users AS (
SELECT tasks_per_project_per_user.user_id,
FROM tasks_per_project_per_user,
total_tasks_per_project
WHERE tasks_per_project_per_user.task_count > (total_tasks_per_project / 2)
)
SELECT
email,
task_list,
title
FROM
users_tasks,
overloaded_users
WHERE
users_tasks.user_id = overloaded_users.user_id
CTEs won’t always be quite as performant as optimizing your SQL to be as concise as possible. In most cases I have seen performance differences smaller than a 2X difference, this tradeoff for readability is a nobrainer as far as I’m concerned. And with time the Postgres optimizer should continue to get better about such performance.
As for the verbosity, yes I could have done this query in probably 10-15 lines of very concise SQL. Yet, most may not be able to understand it quickly if at all. Readability is huge when it comes to SQL to ensure its doing the right thing. SQL will almost always tell you an answer, it just may not be to the question you think you’re asking. Ensuring your queries can be reasoned about is critical to ensuring accuracy and CTEs are one great way of accomplishing that.
If you’re looking for a deeper resource on Postgres I recommend the book The Art of PostgreSQL. It is by a personal friend that has aimed to create the definitive guide to Postgres, from a developer perspective. If you use code CRAIG15 you’ll receive 15% off as well.