Postgres Database Design Notes

Hold on Cowboy

This blog post is pretty old. Be careful with the information you find in here. The Times They Are A-Changin'

I love projects where I’m learning something new, don’t you? When I get in a project, I want to be learning new things on a regular basis, otherwise I grow restless. Lately, I’ve been on a pretty big project for a client designing a backend system (GraphQL + Node.js + Postgres). I can’t go into too much detail, but we’ve leveraged Postgres as our main data store and in the process we’ve had to level up on our Postgres skills. This post consists of random notes about PG that I didn’t know a few months ago.

String types (varchar versus text)

In other databases, you have to declare varchar(255), if you want a variable string column that has a max of 255 characters. You always have to be careful that you don’t exceed this restriction, otherwise the database will alert you in no uncertain terms (aka, it will blow up). With PG, you can forget all that trash, because under the hood, varchar and text are the same. Yeah, I know, crazy.

Tip: There is no performance difference among these three types, apart from increased storage space when using the blank-padded type, and a few extra CPU cycles to check the length when storing into a length-constrained column. While character(n) has performance advantages in some other database systems, there is no such advantage in PostgreSQL; in fact character(n) is usually the slowest of the three because of its additional storage costs. In most situations text or character varying should be used instead. — PG docs

Lesson Learned: Use TEXT, unless you have a specific need to limit the characters

CREATE TABLE people (
    first_name        TEXT NOT NULL,
    last_name         TEXT NOT NULL
);

ENUMs

The debate rages on, “To use ENUMs, or NOT use ENUMs”. When you use ENUMs, data integrity is enforced, this is a good thing. Think about allowing a free form TEXT field like colors. You might get values like this in your database.

Grey
GRAY
gray
grey

If your solution is “the app will lower/upper case the value”, then you are essentially making your app the only consumer of the database. What happens when another process starts using your database? I view this as a very limited database design mindset when you only view the database as being the backend to one type of frontend. You should design your database as a database that exists to serve and protect the data it houses. Too many times I see developers treating their database as dumb data stores, and hold much of the logic in the app that could easily reside in the database. When you go down this route, you may be leveraging your strengths (app development), but losing out on very powerful tools that reside in the database engine. It also requires you to implement logic in multiple places rather than just depend on an intelligent data store.

These are functions your database can easily handle without breaking a sweat

Transforms: lowercase / uppercase values on insert ✅
Atomic Updates: automatically update a timestamp e.g. updated_at ✅
Constraints: For example, if you only want to allow three addresses for a person ✅
Defaults: Values, create them in the database and allow them to be overridden by your app ✅

OK, that was a rabbit trail, back to the main subject ENUMs

Let’s look at a couple of notes about ENUM types. First, we’ll start with this ENUM type.

CREATE TYPE phone_type  as ENUM ('MOBILE', 'WORK', 'HOME')

You can add to an ENUM type, but not remove. Add another enum value with ALTER TYPE item_status ADD VALUE IF NOT EXISTS 'OTHER'. Just know that OTHER is now permanently in that type. It can’t be easily removed.
Altering an ENUM type must be done outside of a transaction. Nuff said.
Get a list of values in an ENUM as rows in a query

SELECT unnest(enum_range(NULL::phone_type));

Use ENUMs to order by. Since ENUMs are really just integers under the hood, you and order by them. This means the order they are declared is important (from least to greatest). So now you can order by ENUM like

SELECT * FROM phones ORDER BY phone_type

SELECT * FROM movies WHERE rating <= PG-13

Setting an `is_default` field

It's a trap

This is part of learning what you really want from your database. We had a many to many table, let’s call it people_addresses for example. We wanted to know the first entry and then set the is_default = true for that record. Sounds simple right?

The first attempt solving this solution was to have the application do a check for any record for a peron’s ID in the people_addresses table, if no record was found, then we set is_default = true when inserting that relation. It worked, but in reality, was subject to race conditions from other applications. It would work fine enough for low volume sites.
One of our talented engineers decided to use a table trigger and automagically set the is_default tag. So this was the trigger that seemed to work during testing

      CREATE OR REPLACE FUNCTION people_address_is_default()
      RETURNS trigger AS $$
      DECLARE
        is_default boolean;
        BEGIN
          is_default := (SELECT NOT EXISTS(SELECT person FROM people_addresses WHERE person = NEW.person));
          NEW.is_default = is_default;
          RETURN NEW;
        END;
      $$ language 'plpgsql';

You’ll notice that the table trigger will check if another record exists and automatically set the is_default value. Pretty slick. The only problem we saw was another race condition (see a pattern). When the client would call the same endpoint with different values. Each insert ran in its own transaction and therefore the is_default was set to true.

Stepping back and looking at the problem, we really didn’t care about an is_default, we mostly cared about detecting the first insert. This can be done with a timestamp. So instead we decided to use a created_at field with a default value of now(). The only drawback now is we have to derive the is_default value by looking at other records and determining if this is the oldest timestamp.
Postgres timestamp precision? During the discussions about using a timestamp for determining the first row inserted, we wondered, “What precision does PG use for storing timestamps? Milliseconds? Nanoseconds?”. The correct answer is a microsecond. Just for reference, 1 millisecond (ms) = 1000 microseconds (μs). We should be pretty safe using microseconds to determine which record was inserted first.
From their own documentation on Date/Time Types
Note: When timestamp values are stored as eight-byte integers (currently the default), microsecond precision is available over the full range of values.

UNION ALL

We found UNION ALL to be a powerful way to combine multiple tables together that had similar types of content into a single view. For argument sake, we didn’t go the route of complete normalization because of some difficultly setting up the data. So we would create a view that looked like this (a poor example, but simple nonetheless).

CREATE VIEW people (
  id
  first_name
  last_name
  date_of_birth
) AS

SELECT id, first_name, last_name, date_of_birth FROM adults
UNION All
SELECT id, first_name, last_name, date_of_birth FROM children
UNION All
SELECT id, first_name, last_name, date_of_birth FROM teachers

This worked well to have a unified table for the app to connect and get the data it needed, but has some drawbacks.

This view is not going to be performant since we just pull several tables together. It would be hard to maintain a good index on this type of structure.
When this view gets large, it’s a pain in the butt to added/remote fields since you need to recreate the entire view again. It’s error prone, and easy to accidentally remove a field.
It’s a short cut to good database design (something that can suffer when moving quickly)

This pattern works well as you are discovering your data requirements, but should moved to more solid structures as your understanding of how data interacts progresses.

Proper handling of IDs and field names

I’m going to briefly touch on IDs, but this subject deserves its own blog post (look for it soon).

Most concrete table types should have an id field. It’s just easier to lookup, select, join, and delete.
NEVER use auto-increment
Did you read that correctly? NEVER use auto-increment. It works fine when you have a small number of tables, and you’re the only one developing on a database. But as soon as you need to create pre-existing seed data; dump and import; or create unique ids across shards; auto-increment will become the bane of your existence. I’m a big fan of UUIDs, a big fan.
Naming your IDs. I think it’s too meta to name your id’s like account_id or person_id. I worked on project once where all the tables had fields named account.account_id and account.account_created_at. Incorporating the table name in the field is super lame.

That’s all for now. Look for new blog posts soon that will dive into performance and that promised field naming post about IDs.