Craig Kerstiens

How I work with Postgres – psql, My PostgreSQL Admin

| Comments

On at least a weekly basis and not uncommonly multiple times in a single week I get this question:

I’ve been hunting for a nice PG interface that works within other things. PGAdmin kinda works, except the SQL editor is a piece of shit

@neilmiddleton

Sometimes it leans more to, what is the Sequel Pro equivilant for Postgres. My default answer is I just use psql, though I do have to then go on to explain how I use it. For those just interested you can read more below or just get the highlights here:

  • Set your default EDITOR then use \e
  • On postgres 9.2 and up \x auto is your friend
  • Set history to unlimited
  • \d all the things

Before going into detail on why psql works perfectly fine as an interface I want to rant for a minute about what the problems with current editors are and where I expect them to go in the future. First this is not a knock on the work thats been done on previous ones, for their time PgAdmin, phpPgAdmin, and others were valuable tools, but we’re coming to a point where theres a broader set of users of databases than ever before and empowering them is becoming ever more important.

Empowering developers, DBA’s, product people, marketers and others to be comfortable with their database will lead to more people taking advantage of whats in their data. pg_stat_statements was a great start to this laying a great foundation for valuable information being captured. Even with all of the powerful stats being captured in the statistics of PostgreSQL so many are still terrified when they see something like:

                                                   QUERY PLAN
----------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=4.25..8.62 rows=100 width=107) (actual time=0.126..0.230 rows=100 loops=1)
   Hash Cond: (purchases.user_id = users.id)
   ->  Seq Scan on purchases  (cost=0.00..3.00 rows=100 width=84) (actual time=0.012..0.035 rows=100 loops=1)
   ->  Hash  (cost=3.00..3.00 rows=100 width=27) (actual time=0.097..0.097 rows=100 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 6kB
         ->  Seq Scan on users  (cost=0.00..3.00 rows=100 width=27) (actual time=0.007..0.042 rows=100 loops=1)
 Total runtime: 0.799 ms
(7 rows)

Empowering more developers by surfacing this information in a digestable form, such as building on top of pg_stat_statements tools such as datascope by @leinweber and getting this to be part of the default admin we will truly begin empowering a new set of user.

But enough of a detour, those tools aren’t available today. If you’re interested in helping build those to make the community better please reach out. For now I live in a work where I’m quite content with simple ole psql here’s how:

Introducing django-db-tools

| Comments

For any successful web application there is likely to come a time when you need to conduct some large migration on the backend. I dont mean simple add a column here or add an index there, but rather truly sizeable migrations… Going from MySQL to Postgres or migrating from an older version of Postgres such as a 32 bit instance to a newer 64 bit instance. In these cases the default approach is to just schedule downtime often throwing up a splash screen saying so.

For many sites this approach is simply wrong and lazy, with little effort you can improve the experience and there by ease the burden in conducting these types of migrations. By having the ability to turn your site into a read only mode which Simon Wilson talked about in his post on Lanyrd you can still continue to operate just in a limited capacity. Andrew Godwin further talks about some of this as well in regards to the Lanyrd move and even includes the script they used to migrate data from MySQL to Postgres. Though just in talking with Simon about this a week ago it occurred to me they had not released the code for their read-only mode.

Finally onto the announcing, today I’m releasing django-db-tools. This is currently a very lightweight utility that allows you to flip your site into two modes.

More on Postgres Performance

| Comments

If you missed my previous post on Understanding Postgres Performance its a great starting point. On this particular post I’m going to dig in to some real life examples of optimizing queries and indexes.

It all starts with stats

I wrote about some of the great new features in Postgres 9.2 in the recent announcement on support of Postgres 9.2 on Heroku. One of those awesome features, is pg_stat_statements. Its not commonly known how much information Postgres keeps about your database (beyond the data of course), but in reality it keeps a great deal. Ranging from basic stuff like table size to cardinality of joins to distribution of indexes, and with pg_stat_statments it keeps a normalized record of when queries are run.

Sharding your database

| Comments

I’m increasingly encountering users on Heroku that are encountering the need to shard their data. For most users this is something you delay as long as possible as you can generally go for sometime before you have to worry about it. Additionally scaling up your database is often a reasonable approach early on and something I encourage as a starting point as scaling up is easy to do with regards to databases. However, for the 1% of users that do need to shard when the time comes many are left wondering where to start, hence the following guide.

How I Write SQL

| Comments

I recently got asked by a friend and former co-worker how I write SQL. At first this caught me by surprise and I assumed there was nothing different, but after a few additional comments on it, it became clear most people have no concept for creating clean readable SQL. So without further adieu here’s how I write SQL, with a built up example query.

Using Postgres Arrays in Django

| Comments

A few weeks back I did a brief feature highlight on Postgres arrays. Since that time I’ve found myself using them with increasing regularity on small side projects. Much of this time I’m using Django and of course not opting to write raw SQL to be able to use arrays. Django actually makes it quite simple to work with Arrays in Postgres with a package by Andrey Antukh. Lets get started by installing two libraries:

pip install djorm-ext-pgarray
pip install djorm-ext-expressions

The first library is for support for the array field type, the second allows us to more easily mix bits of SQL within the Django ORM.

Redis in my Postgres

| Comments

Yesterday there was a post which hit Hacker News that talked about using SQL to access Mongo. While this is powerful I think much of the true value was entirely missed within the post.

SQL is an expressive language, though people are often okay with accessing Mongo data through its own ORM. The real value is that you could actually query the data from within Postgres then join across your data stores, without having to do some ETL process to move data around. Think… joining sales data from Postgres with user reviews stored in Mongo or searching for visits to a website (retained in redis) against purchases by user in Postgres.

The mechanism pointed out was a MongoDB Foreign Data Wrapper. A Foreign Data Wrapper or FDW essentially lets you connect to an external datastore from within a Postgres database. In addition to the Mongo FDW released the other day there’s many others. For example Postgres 9.0 and up ships with one called db_link, which lets you query and join across two different Postgres databases. Beyond that there’s support for a variety of other data stores including some you may have never expected:

Lets look at actually getting the Redis one running then see what some of the power of it really looks like. First we have to get the code then build it:

Understanding Postgres Performance

| Comments

For many application developers their database is a black box. Data goes in, comes back out and in between there developers hope its a pretty short time span. Without becoming a DBA there’s a few pieces of data that most application developers can easily grok which will help them understand if their database is performing adequately. This post will provide some quick tips that allow you to determine whether your database performance is slowing down your app, and if so what you can do about it.

Understanding your Cache and its Hit Rate

The typical rule for most applications is that only a fraction of its data is regularly accessed. As with many other things data can tend to follow the 80/20 rule with 20% of your data accounting for 80% of the reads and often times its higher than this. Postgres itself actually tracks access patterns of your data and will on its own keep frequently accessed data in cache. Generally you want your database to have a cache hit rate of about 99%. You can find your cache hit rate with:

SELECT 
  sum(heap_blks_read) as heap_read,
  sum(heap_blks_hit)  as heap_hit,
  sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read)) as ratio
FROM 
  pg_statio_user_tables;

Arrays in Postgres

| Comments

Postgres out of the box has an abundance of datatypes, from standard numeric datatypes to geometric or even network datatypes. With extensions you can get even more out of it as earlier discussed with hStore. Though with all of the datatypes its easy to miss out on some of them that are there, in fact one of my favorites is often missed entirely. The Array datatype lets you do just as you’d expect, store an array inside Postgres. With this you can often get some of the functionality you’d want in a single table when you might traditionally have expanded to multiple tables.

The broader question may be why you’d actually want to use an array. One good reason may be if you’re an application developer its how you think of your data, so why not model it the same way. As you’ll see below it can be easier than joining and aggregating across a set of rows. Also depending on your case you performance could be improved, though mileage may vary here as it does depend on the data you’re storing.

First a bit of a hacky example… Lets say you have a basic website that sells stuff, and instead of having a purchase ID and a total you want to include the quantity, id, and price of each item in a single row. With a bit of a messy foreign key (using a decimal) you could store all of this within a single row:

CREATE TABLE purchases (
    id integer NOT NULL,
    user_id integer,
    items decimal(10,2) [100][1],
    occurred_at timestamp
);

With this table I could have an array that holds multiple records of:

  • The item purchased
  • The quantity
  • The price

Rapid API Prototyping with Heroku Postgres Dataclips

| Comments

For small and large applications there often comes a time where you’re busy creating an API. The API creation process usually takes the form of something like: Design your API, Implement your API, Test and Evaluate, Rinse and Repeat. Historically with implementing the API fully you can’t see how you truly feel about the result, causing this cycle to take longer than it should. Heroku Postgres has Dataclips, which (among other things) can be used for quickly prototyping APIs. Dataclips allows you to easily share data, but more importantly consume it in a form much like you would a restful API. Lets take a look at how this would work:

Given a schema

schema

We can see from the screen shot of the schema above we can see we have a few tables. These tables are the complete works of Shakespeare thanks to opensourceshakespeare. Lets take a couple of hypothetical endpoints we’ve decided on that we’d like to expose for users and test as an API.

  • The number of works per year
  • Drone factory (this is a fun one courtesy of Richard Morrison - @mozz100 essentially who has the longest paragraphs on average in his works.