The best Postgres feature you're not using – CTEs aka WITH clauses

Mon, 18 Nov 2013 12:55:56 -0800

SQL by default isn’t typically friendly to dive into, and especially so if you’re reading someone else’s already created queries. For some reason most people throw out principles we follow in other languages such as commenting and composability just for SQL. I was recently reminded of a key feature in Postgres that most don’t use by @timonk highlighting it in his AWS Re:Invent Redshift talk. The simple feature actually makes SQL both readable and composable, and even for my own queries capable of coming back to them months later and understanding them, where previously they would not be.

The feature itself is known as CTEs or common table expressions, you may also here it referred to as WITH clauses. The general idea is that it allows you to create something somewhat equivilant to a view that only exists during that transaction. You can create multiple of these which then allow for clear building blocks and make it simple to follow what you’re doing.

Lets take a look at a nice simple one:

WITH users_tasks AS (
SELECT
users.email,
array_agg(tasks.name) as task_list,
projects.title
FROM
users,
tasks,
project
WHERE
users.id = tasks.user_id
projects.title = tasks.project_id
GROUP BY
users.email,
projects.title
)

Using this I could now just append some basic other query on to the end that references this CTE users_tasks. Something akin to:

SELECT *
FROM users_tasks;

But where it becomes more interesting is chaining these together. So while I have all tasks assigned to each user here, perhaps I want to then find which users are responsible for more than 50% of the tasks on a given project, thus being the bottleneck. To oversimplify this we could do it a couple of ways, total up the tasks for each project, and then total up the tasks for each user per project:

total_tasks_per_project AS (
SELECT
project_id,
count(*) as task_count
FROM tasks
GROUP BY project_id
),
tasks_per_project_per_user AS (
SELECT
user_id,
project_id,
count(*) as task_count
FROM tasks
GROUP BY user_id, project_id
),

Then we would want to combine and find the users that are now over that 50%:

overloaded_users AS (
SELECT tasks_per_project_per_user.user_id,
FROM tasks_per_project_per_user,
total_tasks_per_project
WHERE tasks_per_project_per_user.task_count > (total_tasks_per_project / 2)
)

Now as a final goal I’d want to get a comma separated list of tasks of the overloaded users. So we’re simply giong to join against that overloaded_users and our initial list of users_tasks. Putting it all together it looks somewhat long, but becomes much more readable. And as a bonus I layered in some comments.

--- Created by Craig Kerstiens 11/18/2013
--- Query highlights users that have over 50% of tasks on a given project
--- Gives comma separated list of their tasks and the project
--- Initial query to grab project title and tasks per user
WITH users_tasks AS (
SELECT
users.id as user_id,
users.email,
array_agg(tasks.name) as task_list,
projects.title
FROM
users,
tasks,
project
WHERE
users.id = tasks.user_id
projects.title = tasks.project_id
GROUP BY
users.email,
projects.title
),
--- Calculates the total tasks per each project
total_tasks_per_project AS (
SELECT
project_id,
count(*) as task_count
FROM tasks
GROUP BY project_id
),
--- Calculates the projects per each user
tasks_per_project_per_user AS (
SELECT
user_id,
project_id,
count(*) as task_count
FROM tasks
GROUP BY user_id, project_id
),
--- Gets user ids that have over 50% of tasks assigned
overloaded_users AS (
SELECT tasks_per_project_per_user.user_id,
FROM tasks_per_project_per_user,
total_tasks_per_project
WHERE tasks_per_project_per_user.task_count > (total_tasks_per_project / 2)
)
SELECT
email,
task_list,
title
FROM
users_tasks,
overloaded_users
WHERE
users_tasks.user_id = overloaded_users.user_id

CTEs won’t always be quite as performant as optimizing your SQL to be as concise as possible. In most cases I have seen performance differences smaller than a 2X difference, this tradeoff for readability is a nobrainer as far as I’m concerned. And with time the Postgres optimizer should continue to get better about such performance.

As for the verbosity, yes I could have done this query in probably 10-15 lines of very concise SQL. Yet, most may not be able to understand it quickly if at all. Readability is huge when it comes to SQL to ensure its doing the right thing. SQL will almost always tell you an answer, it just may not be to the question you think you’re asking. Ensuring your queries can be reasoned about is critical to ensuring accuracy and CTEs are one great way of accomplishing that.

If you’re looking for a deeper resource on Postgres I recommend the book The Art of PostgreSQL. It is by a personal friend that has aimed to create the definitive guide to Postgres, from a developer perspective. If you use code CRAIG15 you’ll receive 15% off as well.

Diving into Postgres JSON operators and functions

Wed, 11 Sep 2013 12:55:56 -0800

Just as PostgreSQL 9.3 was coming out I had a need to take advantage of the JSON datatype and some of the operators and functions within it. The use case was pretty simple, run a query across a variety of databases, then take the results and store them. We explored doing something more elaborate with the columns/values, but in the end just opted to save the entire result set as JSON then I could use the operators to explore it as desired.

Here’s the general idea in code (using sequel):

result = r.connection { |c| c.fetch(self.query).all }
mymodel.results = result.to_json

As the entire dataset was stored as some compressed JSON I needed to do a bit of manipulation to get it back into a form that was workable. Fortunately all the steps were fairly straightforward.

First you want to unnest each result from the json array, in my case this looked like:

SELECT json_array_elements(result)

The above will unnest all of the array elements so I have an individual result as JSON. A real world example would look something similar to:

SELECT json_array_elements(result)
FROM query_results
LIMIT 2;
json_array_elements

{“column_name”:“data_in_here”} {“column_name_2”:“other_data_in_here”} (2 rows)

From here based on the query I would want to get some specific value. In this case I’m going to search for the text key column_name_2:

SELECT json_array_elements(result)->'column_name_2'
FROM query_results
LIMIT 1;
json_array_elements
-----------------------
"other_data_in_here"
(1 rows)

One gotcha I encountered was when I wanted to search for some value or exclude some value… Expecting I could just compare the result of the above in a where statement I was sadly mistaken because the equals operator didn’t translate. My first attempt at fixing this was to cast in this form:

SELECT json_array_elements(result)->'column_name_2'::text

The sad part is because of the operator the cast doesn’t get applied as I’d expect. Instead you’ll want to do:

SELECT (json_array_elements(result)->'column_name_2')::text

Of course theres plenty more you can do with the JSON operators in the new Postgres 9.3. If you’ve already got JSON in your application give them a look today. And while slightly worse, if you’ve got JSON stored in a text field simply cast it with ::json to begin using the operators.

Postgres, PostgreSQL on CRAIG KERSTIENS

The best Postgres feature you're not using – CTEs aka WITH clauses

Diving into Postgres JSON operators and functions