Unfinished Business with Postgres

2022-05-18 [Last modified: 2024-10-21]

/2022/05/18/Unfinished-Business-with-Postgres/ map[name:Craig Kerstiens]

Postgres

7 years ago I left Heroku. Heroku has had a lot of discussion over the past weeks about its demise, whether it was a success or failure, and the current state. Much of this was prompted by the recent, and on-going security incident, but as others have pointed out the product has been frozen in time for some years now.

I’m not here to rehash the many debates of what is the next Heroku, or whether it was a success or failure, or how it could have been different. Heroku is still a gold standard of developer experience and often used in pitches as Heroku for X. There were many that tried to imitate Heroku for years and failed. Heroku generates sizable revenue to this day. Without Heroku we’d all be in a worse place from a developer experience perspective.

But I don’t want to talk about Heroku the PaaS. Instead I want share a little of my story and some of the story of Heroku Postgres (DoD - Department of Data as we were internally known). I was at Heroku in a lot of product roles over the course of 5 yrs, but most of my time was with that DoD team. When I left Heroku it was a team of about 8 engineers running and managing over 1.5m Postgres databases–a one in a million problem was once a week, we engineered a system that allowed us to scale without requiring a 50 person ops team just for databases

This will be a bit of a personal journey, but also hopefully give some insights into what the vision was and hopefully a bit of what possibilities are for Postgres services in the future.

I wasn’t originally hired to work on anything related to Postgres. As an early PM I first worked on billing, then later on core languages and launching the Python support for Heroku. It was a few months in when I found myself having conversations with many of the internal engineers about Postgres. “Why aren’t you using hstore?”, “Transactional DDL to rollback transactions is absolutely huge!”, “Concurrent index creation runs in the background while not holding a lock, this should always be how you add an index.” Now we had some great engineers, but it was the typical engineer that interacted through ActiveRecord and didn’t want to think about the database.

As I found myself evangelizing Postgres, suddenly I was being recruited by the lead of the Postgres team to come and do marketing, I didn’t know anything about marketing and thought they were joking. A couple months later found myself doing PM and marketing for DoD.

Why did Heroku pick Postgres?

But, I’m getting a little bit ahead of myself. How did Heroku even start doing Postgres or why? Running a PaaS (platform as a service) is a lot of work, running a database is a lot of work. In some sense doing both is splitting your focus. And I’m increasingly coming to the conclusion that platform companies will do best to focus on their platform and data companies will do best to focus on the data. Well, way back in the day we had all these Rails developers asking for a database and we thought how hard could it be? (It was more work than we expected). So we’re gonna run a database, the question becomes which one? Most folks didn’t have strong opinion, but one of our security/ops engineers chimed in “Postgres has always had a good record of being safe and reliable for data, I’d go with it.”

And with that, we were building and launching Heroku Postgres.

The first version of Heroku Postgres, no automation, no self service provisioning, you opened a ticket and we’d correspond and ask when you wanted us to set it up for you.

Before Heroku Postgres was Postgres

The very first version before we committed to Heroku Postgres was internally known as Shen. The model was much more akin to shared hosting that was common for that time and place. Within a single instance we’d pack in a lot of Postgres databases, simply running createdb, creating a user for you, then giving you access to that DB. This worked fine for those just kicking the tires and building small apps, but despite telling them to not use it for production people continued to.

While Heroku had Postgres before Heroku Postgres it became a project and its own orchestration layer for databases as a more first class citizen around 2010. The initial codename for the project was “bifrost”.

The original design

Heroku Postgres was designed to have a central FSM (finite state machine) that would orchestrate the databases. This design pattern to my knowledge came from @PvH’s work and appreciation of video game system design. It felt like a novel approach to the software being built at this time. The fact that it is a more common design patten now shows what a great design it was, and how ahead of its time the level that team was building at.

It was a basic Ruby application that would go out and check the state databases and go through the needed steps when interacting with either Postgres or AWS APIs. AWS APIs back then were not what they are now and this allowed to build in necessary retries, redundancies, and quality of service.

Sometimes you’re good, sometimes you’re lucky

Sometimes you’re good, sometimes you’re lucky, sometimes it’s both. Over time we built out more reliable provisioning and monitoring of databases. In early 2011 we felt a need to continue improving this. It was the early days of EC2 and reliability wasn’t the strongest spot, instances could go offline.

Per @danfarina “We were thinking about working on replication but skipping over archiving (by more carefully managing state between servers, e.g. by directly moving things through rsync, which was/is still pretty normed postgres stuff predating pg_basebackup) but then one of the shared databases (shen) had a near miss when a disk was lost that caused a rather horrific amount of effort and some nailbiting in restoring from pgbackups.”

“We then decided to take previouly rejected approach of building everything up on the archives. the first versions were s3cmd based, and were something of a prototype, upon request from PvH to ship something more raw, but more quickly. We had just got early versions into production when the major disk apocalypse hit in April 2011, though it was in something of an evaluation period and it was not yet a well-exercised & monitored program, so we crossed our fingers and, thankfully, it worked on every database, once AWS had capacity to spare.”

At the end of the event it was communicated eventually that if we wanted AWS could give us all of the EBS disks back, but it could be all of them were corrupted. As someone we can’t recall adequately described it: “It’s like getting a gallon of ice cream, but it may have a turd in it”.

More from @danfarina: “Were it up to me at the time, I probably would have moved onto converting it to boto (one of the most mature AWS client bindings at the time, by far) before stopping to deploy the s3cmd near-prototype on every instance, but that would have been a disaster.”

Our applications were resilient because we could leverage multiple dynos and the routing layer available to our system. Databases are a little different, but how could we at scale give the features you most needed for a database:

No data loss
Improved availability

Number one was always a core charter for the team and we would prioritize this over features, and over uptime. Uptime mattered, features mattered, but as a data provider if we lost data we’d lose trust. Thus that quick prototype became wal-e, which then went on to power many of the future features of Heroku Postgres, and be used for many years, though now has been deemed obsolete in favor of more modern tooling such as pgBackRest. But for it’s time and place it was some good execution and some luck on timing.

Thinking about the entire experience

As Heroku sat at a central point of app deployment we actively tried to think about the experience end to end. This manifested in some of the small things we actively campaigned for and collaborated with the community members who could make these happen. A few key examples come to mind for this.

The first was DATABASE_URL. Some of this originated from the 12factor concepts, others in that having 5 environment variables to define what you were connecting too felt verbose and cumbersome. Why couldn’t Rails just use DATABASE_URL if it were defined. I don’t recall the specifics here, but suspect this was something we nudged @hone02 to help with.

The second was around some features of Postgres. At the time Postgres was going well, but most of the core community focused on performance or enterprise-y features. We were coming at it from a different angle with an audience of Rails developers. We were intentional and engaged with some of the Postgres consultancies that employed committers, with a general theme of how can we help contribute, while also advancing Postgres based on the knowledge we have from users. A few highlights here included:

Not just the DATABASE_URL on the Rails side, but also on the native libpq wire protocol. While we didn’t do the work here, we were spent notable time advocating and engaging around it.
pg_stat_statements in my opinion now the most valuable Postgres extension existed before, but was effectively unusable for most applications. Funding this work was foundational to make Postgres have more usable insights natively.
JSON/JSONB collaboration

Of note we later hired several contributors to largely focus their time on upstream Postgres itself.

Dataclips vs. the team

Throughout the history of Heroku Postgres various individuals made a series of bets. First it was @PvH pushing for Wal-E to get out the door as an MVP, which was absolutely the right call in retrospect. Perhaps the one that is most exciting to me and people least associate with Heroku is dataclips. Matt Soldo, had this idea of GitHub Gists for your database. But in the early days of Heroku as a PM, like many places as PM you didn’t have an ease to mandate engineers try a thing. You had to campaign and convince for a thing.

Soldo didn’t build up enough buy in from the team, but had done plenty of awesome things that it was worth letting him run with this. We were all wrong. Dataclips was built by an external consulting company as a separate standalone Heroku app. It was only live for weeks and suddenly it was powering all of our internal dashboards. A live query you could embed into google sheets, suddenly our internal KPI rails app was replaced by dataclips and a google sheet. We didn’t need looker or tableau or other fancy BI tools and this lasted into 100m in revenue for insights into the business. To this day dataclips is one of my favorite features of Heroku, and I look forward to making an experience like that but even better, thanks Soldo for not listening to the rest of us back then.

Names matter

For a long time databases haven’t been known for being user friendly. We wanted to pull from other paradigms as we were making key database tenets available to people. We looked heavily to git for inspiration around forking/following. Database terms were common as master/slave, but we knew we could do better. Ee wanted to give the user a sense of what they could do with it. Archiving the WAL every 16 MB or 60 seconds (whichever was less) became continuous protection. Forking, was a snapshot as of some point in time. Follower, something that followers a leader node (a read-replica). I still recall an hour long analyst call with Redmonk with @monkchips and @sogrady–it was mostly wind them up and let them go (for the record @monkchips didn’t love fork/follow, I think he may have come around now).

This started even earlier than my time with being intentional about Postgres vs. PostgreSQL. But I’ve covered that one before, and you can even see it in some of the other lobbying around libpq.

Peacetime vs. Wartime

Things were rolling along well, we were shipping new features. We’d added dataclips, fork/follow were in existence for a while–and then we got a note from Amazon they were launching Postgres support for RDS at the next ReInvent. I was in person at that ReInvent and I’d never seen a roaring standing applause like that at a tech conference before when the moment was announced. In private channels we heard notes that this was because of us, the excitement and demand for Postgres became too clear for them to ignore and they had to add support.

We felt vindicated in our choice of Postgres and in what we’d built. But we also knew that now we had competition, running a database as a service on another infrastructure provider how can you compete? Well from some years of experience now I can say there are definitely ways and am confident that sharp narrow focus allows you to build amazing products which can be harder to do inside a large giant corporation. It was at one of allballs (UTC 00:00:00-when the data team would do happy hour on Fridays) that we were drinking beers and discussing how now we really had to focus. @PvH and @danfarina were discussing me personally as a leader–apparently I’m okay in peacetime when things are good but in wartime they were willing to bet on me.

Two weeks later I walked into the exec team meeting with a 2-pager assessment of what may happen, how we could compete, including 3 potential acquisition targets that could allow us to have a more differentiated offering. Within a few months we made one of those acquisitions which later became Heroku Connect. It made a lot of sense for many reasons, including Adam Gross was an angel investor in Heroku and knew it well and had helped build Salesforce Platform in early years. That wasn’t the end, but was just the beginning of how we could actively compete vs. simply being “hosted Postgres” vs. a more fully managed experience.

Metrics and Monitoring that almost was

The next vision and goal for Heroku Postgres was to continue to give better ease of use and insights into your database. Postgres itself already has a ton of awesome data inside it about how it’s been used. The catalog tables and extensions like pg_stat_statements have a wealth of information, but querying it looks like 200 lines of black magic SQL. @leinweber was perhaps one of the first, and the best at the team and quickly making something usable for people. The first step on our metrics journey was him making these Postgres insights trivially accessible via pg-extras.

Continuing on the journey internal foundational systems were built, in fact we just spent 3 months @crunchydata building similar systems that were spiritually aligned, to focus on collection of various meta data from systems and the ability to notify and communicate. I’m blanking on some of the systems, though some were obvious–observatory (I’m not sure if still in use or not) would observe databases.

These systems started to house a lot of information that then powered pieces within your Heroku Postgres dashboard. Things like slow queries and high IO or CPU load would give good insights when you logged in. The eventual goal was to connect the dots through proactive notifications. It’s one thing to get an alert from pagerduty that things are off, login to a dashboard and fix it. But what if in the early signs of things starting to go south we emailed you that you’ve got an increasing number of sequential scans that are starting to put you at risk of IOPS saturation, and because we understand Postgres you can add an index and resolve this with a single command. We could even give you an easy button in your email to add the index. All the foundations were in place, if you’ve run on Heroku you’ve gotten the notifications about database maintenance, that was powered entirely from the underlying notification system built with this goal in mind.

Postgres can still be better for developers

But we never made it there. Some of us shuffled to different teams, some of us moved on to new challenges, and many folks came on after to continue to run and power a great database service. Some of those original engineers Daniel Farina and Will Leinweber (along with PvH) understand the design and why as well as anyone. The goal from early on was that we could do better for developers.

Two years ago when people asked why I came to Crunchy Data I told them I had unfinished business. After the success at Citus tackling scalability problems where the average customer being 40TB in data, I attracted to the idea of returning to the vision we had back at the DoD of bringing a better Postgres experience to developers. Despite the rapid growth of successful DBaaS offerings, there was still something missing - that initial idea of DoD that we still wanted to create.

Postgres is an amazing database, can handle hundreds of thousands of transactions per second often without batting an eye. Has internal data that you can easily look at to assess and improve performance. Has a rich set of datatypes, indexing, and functionality. The extension ecosystem is vast. But as a developer you don’t have time to become an expert on Postgres.

What if we told you when a N+1 query snuck into your Rails app?
Connections don’t have to be a limitation on Postgres when you have pgBouncer right there.
Have excessive indexes from the early stage of building your app? What if we told you about them and with a button click you could drop them.

One of our @crunchydata customers described it better than I ever could. When working with some of our experts on some deep dives into their database they said “you should take all his Postgres expertise and just bottle it up and send it in email or slack reports to me.” I want that expertise as a product.

My product strategy isn’t to go and change the world of databases. Postgres is a great database with a community that is making it better daily. I want to help make open source Postgres better and give back to it along the way. My product strategy is to distribute deep Postgres expertise in a consumable form to every customer of ours in the coming years. Oh we’ll ship some cool things along the way too.