Richard Yen

Disaster Recovery is a Process, Not a Tool (Part 1)

Mon, 15 Jun 2026 08:00:00 +0000

The Landscape Has Changed

When I was at Turnitin, we were still kind of riding the tail end of the dot-com boom. People were rushing to ship things, and brief outages were not exactly good, but they were considered a normal part of running software on the internet. If the site was down for a few minutes, you’d shrug, dig in, and fix it.

That’s not really the world we live in anymore. Uptime is much more sensitive than it used to be. Five nines used to be the stretch goal – now four nines is something a lot of teams just treat as the expectation, and even a few minutes of outage in a month feels like a lot. We don’t really track averages in our metrics anymore, either; we track p99 latencies, because we actually care about that last 1% of users having a good experience.

The other thing that’s changed is how quickly outages get socialized. A noticeable hiccup in your service can end up on social media before your on-call has even finished acknowledging the page. In my experience, the worst situations are the ones where customers find out about an issue before the company does. That has both a financial cost and a reputational cost, and the reputational cost tends to linger long after the incident is resolved. Frequent outages chip away at users’ willingness to keep using your product.

Postgres is, of course, no exception. So that’s the world a Postgres DR plan has to operate in.

What Counts as a Disaster?

When people hear “disaster recovery,” I think the natural mental picture is a natural disaster – a flood, an earthquake, a wildfire, or maybe a long utility outage that takes a data center offline. And those are real concerns; we put generators and solar panels and multi-region replication in place partly to deal with exactly that.

But in my experience, most of the disasters that take a Postgres database down don’t look anything like that. They look like:

A performance regression after a failover, where the service is technically “up” but slow enough that customers can’t really use it.
Corruption from a bad migration – something the deployment pipeline didn’t catch, and now half the rows in a table look wrong.
A security incident, where somebody got in and may have tampered with data.
A subtle application bug that writes the wrong values, or reads them back the wrong way, for days before anyone notices.
Replication that quietly broke, or WAL that quietly went missing.
Accidental deletes – the classic missing WHERE clause.

If I had to put a definition on it, I’d probably say something like: a disaster is any sustained event that compromises a system’s availability, correctness, or business trust. Availability is the one that gets the most attention, but the other two are arguably more dangerous, because they tend to be discovered later and resolved with less confidence.

How DR Is Usually Done

If you ask most teams how they do disaster recovery, you’ll usually hear two words – not because they’re wrong, but because they’re the first things that come to mind. Those words are preparation and prevention.

Preparation looks like checklists, backups, monitoring, scenarios to think through, and playbooks of various levels of detail. Prevention looks like alerting, automated remediation, self-healing systems, Patroni, redundancy, load balancing, and so on. Both are good. Benjamin Franklin’s “an ounce of prevention is worth a pound of cure” is on the wall of more than one ops team I’ve worked with, and there’s a reason for that – prevention really is cheaper than recovery on average.

But preparation and prevention only get you so far, and I don’t think they’re really the same thing as recovery. Recovery is what happens after preparation and prevention have already failed to keep the lights on. It’s the act of taking a system that’s already down (or already untrustworthy) and restoring business operations.

That distinction sounds almost too obvious to say out loud, but in my experience it’s the part teams are least ready for. A lot of the customers I worked with at EDB were genuinely well-prepared, with great backups and good monitoring, and they were still unprepared the day they actually had to recover. I’ve been in that seat too, as a DBA – everything was in place on paper, and we still fumbled the first real incident. Recovery is its own skill.

Postgres Already Gives Us Most of the Tools

One nice thing about Postgres is that the toolbox for recovery is already pretty good. Off the top of my head, there’s pg_dump and pg_restore for logical backups, pg_basebackup for physical ones, pg_stat_replication to see what your standbys are doing, pg_stat_activity to see what your sessions are doing, point-in-time recovery for anything more granular than “last night’s backup,” and tools like repmgr and EFM (and pgBackRest, Barman, and others) for orchestration and richer backup workflows.

These tools are not the bottleneck. In nearly every case I worked at EDB, the question wasn’t “do we have the technology to recover?” It was, “do we know when to use it, how to use it, and who gets to make the call?” I had a customer once who had perfectly good backups – they really did – but they opened a P1 ticket asking me to walk them through the keystrokes for the restore. I think they actually knew what to do; they were just afraid, in the moment, of typing the wrong thing. That’s a process gap, not a tool gap, and no amount of additional automation would have fixed it.

I’d add a slightly uncomfortable note here: as a vendor’s support engineer, I was always happy to help, but we probably shouldn’t be the centerpiece of anyone’s DR plan. Support engineers can hand you tools and walk you through documentation, but we don’t know your data the way your team does, and there’s a liability we’re not really supposed to take on. If the first time a team reads the failover documentation is during the outage, a support contract alone isn’t going to close that gap.

RPO and RTO, and Why They’re Negotiations

You can’t really talk about recovery without talking about RPO and RTO, so let me do that briefly.

RPO (Recovery Point Objective) is roughly “how much data are we willing to lose?” Do we restore from last night’s backup and accept losing the day’s writes? Or do we replay WAL and try to get as close as we can to the moment of the outage?

RTO (Recovery Time Objective) is “how long are we allowed to be down before we’re considered back up?”

Every choice on either of these axes is a trade-off – against cost, against complexity, against operational burden, against acceptable business loss. And the reality is that during an outage, you really are losing business; transactions don’t happen, shopping carts don’t get checked out, customers get frustrated. At the same time, getting up faster usually means accepting more data loss, or paying more for the infrastructure to avoid it.

It’s helpful to think about RPO in tiers:

A 24-hour RPO is basically “restore last night’s backup.” One or two people can usually handle it, the moving parts are simple, and the data loss can be substantial. That’s fine for some workloads. It’s not really acceptable for high-traffic services where 24 hours of writes is a lot.
A 15-minute RPO generally means WAL archiving or shipping, monitoring to make sure none of that WAL goes missing, regular validation that you can actually restore in 15 minutes, and operational discipline around retention. That’s reasonable for many systems, but probably not acceptable for, say, a financial institution.
A near-zero RPO typically means synchronous replication and tightly managed failover. Now you’re dealing with latency between nodes, distributed-systems complexity, split-brain scenarios, and a much bigger operational footprint.

Lower RPO isn’t just “better.” It’s a design and operational commitment, and that commitment costs money, time, and people’s attention.

The same is true of RTO. Driving RTO below five minutes generally requires automation and – this part is important – rehearsal. If you hand someone a document for the first time during an actual outage, they are not going to execute it quickly, no matter how clear the document is.

This is why I think RPO and RTO really need to be negotiated, not just declared. On the surface it’s almost a no-brainer – of course everyone wants an RPO of zero and an RTO of seconds. But when you actually go to leadership and lay out what those numbers cost, you tend to find out pretty quickly where their priorities really sit. In a lot of cases, they’d rather spend that money on something that looks more directly tied to the business – a new feature, a marketing push, another engineer on the product team – and they’re willing to accept a softer RPO or RTO in exchange. That’s a legitimate answer; it just needs to be made explicitly, instead of being assumed one way or the other by the infrastructure team.

Three Layers of DR Planning

When I think about what a DR plan needs to cover, I find it useful to break it into three layers.

The first layer is infrastructure failure. This is the one most teams think of first: a region goes down, storage fails, a corruption bug bites, credentials leak, somebody accidentally deletes a table, replication breaks. Hardware and platform behaving badly.

The second is procedural failure. Even if the infrastructure problem is well-understood, you can still fail recovery because the procedure is wrong. Maybe the sequence values weren’t included in the backup and you didn’t realize. Maybe the runbook references a CNAME nobody can find the host for anymore. We used to have a setup at Turnitin where, on every failover, we had to repoint a CNAME to the new primary, and we eventually realized that nobody had documented which CNAME pointed to which underlying host. Maybe the validation step is vague. Procedural failure tends to be invisible until the moment you actually need the procedure.

The third is human failure. People behave differently under duress. Some panic. Some zone in so hard on one screen that they miss the bigger picture. There are conflicting instructions between managers, between teams, between people trying to be helpful. There’s the 3AM call where the on-call is barely awake and not entirely sure what’s going on. And there’s the person who can’t wait for the process and decides to just do something heroic and fast – which sometimes works, and sometimes makes things significantly worse.

To make the layers concrete: I had a 3AM incident at Turnitin once where we rolled out a change in the evening and got paged a few hours later. The disk had filled, and the filesystem ended up unmounted. That was the infrastructure failure. In the scramble to bring it back, somebody tried to remount it as ext4 instead of xfs – that was strike one, a procedural failure, because the runbook didn’t make the filesystem type explicit. Then we sat for a while waiting on the CTO, because nobody on the bridge had clear authority to call any of the next steps – strike two, no incident commander. And then somebody prematurely brought the web servers back up before the database was really healthy, causing a second round of errors – strike three, the hero move. No single one of those was catastrophic; together they turned a one-hour problem into a much longer night. That’s what the three layers look like in practice.

Recovery Isn’t Always About Failing Over

A lot of DR talks (and a lot of DR vendors) make it sound like “recovery” basically means “fail over to the standby.” That’s one tool in the box, but it’s nowhere near the whole box.

Here’s a story that’s stuck with me. I was on a small team that shipped a release, and the migration looked clean – everything came up, the smoke tests passed, we went home feeling pretty good. Later, somebody noticed that the application code had a small typo in its SQL: an extra apostrophe was getting written into every comment in a comment thread. The data wasn’t lost. The system was up. But the data was wrong, and it kept getting more wrong every minute the application stayed online.

In that particular case, a careful UPDATE across the table was probably the right call, with all the locking and performance impact that implies. But if you change the details a little – say the corruption is medical records, or it isn’t discovered for a few days, or some of those rows have already been read by other systems and propagated outward – a simple UPDATE stops being the answer. Now you’re asking whether you have enough WAL retained to do point-in-time recovery, whether you can safely update some rows in place, when exactly the corruption started, and so on.

I bring it up because that scenario is just as much a disaster, and just as worth planning for, as a disk failing or a region going dark. And it can’t be solved by failing over – the standby would have the bad data too.

While I’m telling stories about quietly-bad situations: another underrated failure mode is “the engineer who knew this part of the system went on vacation,” or quit, or moved teams, and the documentation never quite got updated. Real DR plans have to assume some of that, too.

What Lower Numbers Actually Cost

Negotiating RPO and RTO sounds abstract until you start listing the consequences. Wanting an RPO of zero pushes you toward synchronous replication and forces you to live with the latency that comes with it. Wanting an RTO of under five minutes pushes you toward automation that has to be built, tested, and maintained, and toward rehearsal cadence that has to be on someone’s calendar. Multi-region pushes operational complexity up significantly – you’ve got clusters in different regions talking to each other, you’ve got cross-region replication lag to tolerate, and now your monitoring story has to account for all of it. Even something as innocuous as “we’d like to be able to do point-in-time recovery to any second over the last 30 days” can mean keeping terabytes of WAL around and paying for storage you barely look at.

None of this is a reason not to do these things. It’s just a reason to have honest conversations about which of them you actually need.

To Be Continued

That covers what I think of as the framing half of the talk: what counts as a disaster, why preparation and prevention aren’t the same as recovery, and how RPO and RTO end up being negotiations rather than declarations.

In two weeks, I’ll get into the part that I think could reduce RTO (something that can’t be replaced by AI): runbook engineering, game days, what to measure, and the cultural piece that holds it all together.

PGDay Boston 2026

Wed, 10 Jun 2026 08:00:00 +0000

Introduction

PGDay Boston 2026 was a rewarding reminder of why I value the PostgreSQL community so much. It was delightful to reconnect with familiar faces, meet new people, and finally put some faces to names for the first time. One of the best parts of the day was the sense that this community is larger than any one employer or project. It is built on shared curiosity, shared responsibility, and a willingness to help one another learn. I’m honored to have been able to share my own thoughts in my Disaster Recovery talk as well.

The keynote, Michael Stonebraker’s “Where Did Postgres Come From?”, was a standout for me. I especially appreciated the history of Postgres and the years before Postgres, during the Ingres era. It was striking to hear how the project could have ended up as just another academic system, yet instead grew into something enduring because people outside of UC Berkeley took ownership of it and built a broader community around it. That story felt like a good reminder that open source succeeds not only through technical merit, but through stewardship and continuity.

I also enjoyed Brian Brennglass’s talk, “Managing and Observing Locks.” His demos made an intimidating topic much easier to follow, and I found the practical framing especially useful. Shree Vidhya Sampath’s session on leveraging Patroni’s synchronous replication while running PostgreSQL on Kubernetes was another highlight. I appreciated the clear discussion of election behavior, synchronous replication, and failover scenarios, including failure modes I had not experimented with myself.

Robert Haas’ “pg_plan_advice: Plan Stability and User Planner Control for PostgreSQL?” was impressive in his attention to detail, especially the way he tested edge cases that people might not think to check. Bruce Momjian’s “What’s Missing in Postgres?” was also thought-provoking because it framed missing features not as oversights, but often as deliberate choices shaped by the needs of the broader community. Ryan Booz’s “Mastering PostgreSQL Partitioning: Supercharge Performance and Simplify Maintenance” rounded out the day well with a useful refresher on partitioning behavior, tradeoffs, and current workarounds.

Overall, the event benefited me in both personal and professional ways. Professionally, it deepened my understanding of PostgreSQL internals, operational patterns, and ecosystem tooling. Personally, it renewed my appreciation for the community that has grown around Postgres and the care that goes into keeping it healthy. Thank you Tom Kincaid, Ken Rugg, Erik Pohi, Greg Burd, Kanchan Mohitey, Shihao Zhong, and so many others – along with PGUS – who worked hard to make the first PGDay Boston a smashing success! I look forward to staying involved and attending future events in Boston and beyond.

Foreign Tables and Materialized Views: A Dynamic Duo

Mon, 25 May 2026 08:00:00 +0000

Introduction

I recently wrote a post about WAL log shipping and how a standby built on log shipping is a great way to give data analysts production data without putting the primary at risk. Having access to the production data in this way is great, but it’s read-only. How can we create views of this data for better analytics work? I want to make the case today that Foreign Data Wrappers and Materialized Views can make a great solution – not only in accessing production Postgres data, but also working with other data sources.

Moving Beyond FDW Demos

Most people meet foreign data wrappers (FDWs) through a quick demo, and I’ve highlighted some of their features in previous conference talks. There is high novelty in being able to query MySQL from Postgres, but the reality is often that the latency between the local database and the foreign table can be pretty high. Sometimes, predicate push-down isn’t what you’d expect, and indexing may not be very transparent. In the end, setting up and managing FDWs may seem more work than it’s worth, and that’s a mistake. Used correctly, foreign tables are one of the most practical tools for analytics across heterogeneous data sources – especially when paired with materialized views.

The Real Problem: Heterogeneous Data

Modern data rarely lives in one place:

Legacy systems in MySQL
Operational data in PostgreSQL
Flat files sitting in object storage (I’ve seen people do this with AWS Athena)
Maybe even some CSVs someone refuses to migrate

Foreign tables give you a unified SQL interface, but under the hood, the query performance can be unpredictable as you may be forced to rely on another engine’s query planner (and in the case of that CSV data source, it might not even be indexed).

In other words, FDWs optimize developer experience, not query performance.

The Pattern: FDW + Materialized Views

Instead of querying foreign tables directly in analytics workloads, we can opt to use FDWs as ingestion points, not as the serving layer itself. To achieve this, we can do the following:

Step 1: Define the foreign table

CREATE FOREIGN TABLE ext_orders (
  id bigint,
  customer_id bigint,
  total numeric,
  created_at timestamp
)
SERVER mysql_server
OPTIONS (table 'orders');

Step 2: Build a materialized view

CREATE MATERIALIZED VIEW orders_mv AS
SELECT
  id,
  customer_id,
  total,
  created_at::date AS order_date
FROM ext_orders;

Step 3: Index it like a real table

CREATE INDEX ON orders_mv (order_date);
CREATE INDEX ON orders_mv (customer_id);

Now we’ve turned a slow, remote dataset into a locally optimized analytical structure.

The materialized view lives inside PostgreSQL, supports full indexing, eliminates network latency during queries, and gives predictable performance. We essentially have a read-optimized cache on top of the foreign tables. We can do this with the read-only Postgres replicas as well, to slice up the columns and rows to fit nicely in a view that analysts would want to use.

Refreshing Without Blocking

When it comes to caching, data gets stale, and we’re sort of back at the same problem every ETL pipeline faces. However, Postgres can refresh a materialized view without blocking users, simply with the CONCURRENTLY syntax. This results in production-quality data with a little bit of staleness, but the nice thing is that it’s all built-in to the Postgres cluster (no separate ETL pipeline to manage, just all the data accessible from one central place). Note, however that in order to use the CONCURRENTLY syntax, a UNIQUE key is required.

Good Applications for the Pairing

The pairing of FDWs and indexed Materialized Views could be very beneficial in a handful of use cases:

1. Poorly Indexed Remote Systems

If your upstream system:

Lacks proper indexes
Is shared with OLTP workloads
Is not under your control

This approach isolates analytics from those constraints.

2. High-Latency Data Sources

Examples:

Cross-region databases
Cloud object storage via FDWs
Athena-backed datasets

Instead of paying the latency cost on every query, you pay it once per refresh.

3. Flat Files and Large Data

Yes, people do this:

Querying CSVs via FDWs
Treating object storage as a “database”
Large JSONB sets that are hard to index well

Final Thoughts

Foreign tables aren’t just novelty – they’re a powerful bridge across messy, real-world data systems.

It is important to distinguish that FDWs make a data access layer, while Materialized Views are the analytics engine. If you 1) layer materialized views on top of FDWs, 2) add proper indexing, and 3) refresh intelligently (preferably concurrently), you can get the best of both worlds: flexibility of federated queries and performance of local analytics.

XID Wraparound's Equally-Evil Twin

Mon, 18 May 2026 06:00:00 +0000

Introduction

If you’ve been running PostgreSQL for any length of time, you’ve probably heard about transaction ID (XID) wraparound. It’s one of the most well-known maintenance concerns in Postgres, and there’s no shortage of blog posts, conference talks, and war stories about it. But there’s a quieter, less-discussed cousin that can cause the exact same kind of outage: MultiXact ID wraparound.

I’ve seen this surprise more than a few experienced DBAs. They’ve got their autovacuum tuned, they’re monitoring age(datfrozenxid), and they’re feeling good – and then out of nowhere, Postgres starts refusing certain writes because it’s approaching MultiXact ID wraparound.

The fix is the same as regular XID wraparound – a simple vacuum. But the reason is different, and understanding it can help you keep your monitoring complete.

What’s a MultiXact ID?

In Postgres, every row has a system column called xmax. In the simplest case, xmax holds the transaction ID of the transaction that deleted or updated the row. But what happens when multiple transactions hold locks on the same row at the same time?

Consider SELECT ... FOR SHARE. Multiple transactions can hold a shared lock on the same row concurrently. Postgres needs to record all of those transactions somewhere, but xmax is only wide enough to store a single transaction ID. The solution is the MultiXact mechanism.

A MultiXact ID is essentially a pointer into a separate structure (stored as a file in the pg_multixact/ dir) that maps to a list of transaction IDs and their lock modes. When multiple transactions need to lock a row, Postgres:

Allocates a new MultiXact ID
Records the set of transaction IDs (and their lock types) in the MultiXact member data
Stores the MultiXact ID in the row’s xmax field, with a flag (specifically, the HEAP_XMAX_IS_MULTI infomask bit in the tuple header) indicating it’s a multi-xact reference rather than a plain XID

This lets the xmax field stay a fixed 32-bit value while still representing an arbitrary number of concurrent row-level lockers.

When Are MultiXact IDs Created?

MultiXact IDs come into play in several scenarios:

SELECT ... FOR SHARE – The classic case. Multiple transactions can hold shared row locks simultaneously.
SELECT ... FOR KEY SHARE – Used implicitly by foreign key checks. If you have a parent table with foreign key references, every insert or update on the child table takes a FOR KEY SHARE lock on the referenced parent row. On a busy system with many concurrent inserts referencing the same parent rows, this generates MultiXact IDs rapidly.
Combination locks – If one transaction holds a FOR KEY SHARE lock and another holds a FOR NO KEY UPDATE lock on the same row, the two locks don’t conflict, and the resulting multi-lock is stored as a MultiXact.

The foreign key scenario is particularly noteworthy because it’s invisible to most application developers. You won’t see any queries explicitly calling out FOR SHARE in application code, but Postgres is silently creating MultiXact IDs behind the scenes to manage the implicit locks.

MultiXact IDs Need Freezing Too!

Just like transaction IDs, MultiXact IDs are 32-bit counters. And just like XIDs, they wrap around. Postgres can only “see” about 2 billion MultiXact IDs into the past. If a row still references a MultiXact ID that’s about to fall off the visible horizon, Postgres has a problem: it can no longer determine whether the locks represented by that MultiXact are still relevant.

To prevent this, Postgres needs to freeze MultiXact IDs, just as it freezes regular XIDs. Freezing a MultiXact means replacing the MultiXact reference in the row’s xmax with either the zero value, a single transaction ID, or a newer multixact ID, depending on whether the lock information is still meaningful.

The relevant settings mirror those for XID freezing:

XID Setting	MultiXact Equivalent
`vacuum_freeze_min_age`	`vacuum_multixact_freeze_min_age`
`vacuum_freeze_table_age`	`vacuum_multixact_freeze_table_age`
`autovacuum_freeze_max_age`	`autovacuum_multixact_freeze_max_age`

When the MultiXact age of a table exceeds autovacuum_multixact_freeze_max_age, autovacuum will trigger an aggressive (whole-table) vacuum specifically to freeze old MultiXact IDs – even if the table has no dead tuples and wouldn’t otherwise qualify for autovacuum.

Don’t Let MultiXact Fly Under the Radar

The query is straightforward:

SELECT datname,
       age(datfrozenxid) AS xid_age,
       mxid_age(datminmxid) AS mxid_age
  FROM pg_database
 ORDER BY mxid_age DESC;

For per-table granularity:

SELECT c.oid::regclass AS table_name,
       age(c.relfrozenxid) AS xid_age,
       mxid_age(c.relminmxid) AS mxid_age
  FROM pg_class c
 WHERE c.relkind IN ('r', 't', 'm')
 ORDER BY mxid_age DESC
 LIMIT 20;

Keep an eye on any table where mxid_age is approaching autovacuum_multixact_freeze_max_age (default: 400 million). If it gets close, autovacuum should kick in, but on large tables or systems with constrained autovacuum workers, it may not complete in time.

Practical Recommendations

Add MultiXact monitoring alongside XID monitoring. If your alerting triggers at, say, 500 million XID age, add a similar alert for MultiXact age.
Watch your foreign key parent tables. If you have a users or accounts table that’s referenced by every other table in the schema, it’s likely accumulating MultiXact IDs faster than you’d expect.
Consider autovacuum_multixact_freeze_max_age tuning. The default of 400 million is higher than the XID autovacuum_freeze_max_age default of 200 million. But in workloads with heavy foreign key activity, you may want to lower it – or configure per-table autovacuum settings on hot parent tables.
Don’t ignore “unnecessary” vacuums. If you see autovacuum running on a table that has zero dead tuples, don’t assume it’s wasting resources. It may be performing MultiXact freezing work that’s critical for preventing wraparound.

Conclusion

MultiXact ID wraparound is the kind of problem that bites you precisely because you didn’t know to look for it. The mechanism exists for a good reason – efficiently tracking shared row locks is fundamental to Postgres’s concurrency model. But the maintenance burden it creates is real, and it demands the same vigilance as XID wraparound.

If you take one thing away from this post: go check mxid_age(datminmxid) on your databases today. If you’ve never looked at it before, now’s a good time to start.

Making JSONB More Queryable with Generated Columns

Mon, 11 May 2026 06:00:00 +0000

Introduction

Over the past year, I’ve worked in a handful of contexts managing large volumes of data stored as JSONB in PostgreSQL. The scenario is common: users appreciate the flexibility of a document-oriented storage model, avoiding the need to predefine schemas or constantly migrate table structures as their data requirements evolve. JSONB documents can be deeply nested with numerous optional fields, and they scale to hundreds of kilobytes per record without issue. However, when the time comes to query these documents – filtering by user ID, event type, timestamps, or nested action properties – the queries can become slow and/or cumbersome to work with.

The problem I want to address is: “How do we make searching JSONB data more efficient without breaking apart our documents or forcing it into columns in a relational database?” There are several approaches available in Postgres, each with different tradeoffs. I hope to shed some light on those approaches in this article.

The Setup

I created a basic, no-frills table for the sake of this test:

CREATE TABLE events (
    id BIGSERIAL PRIMARY KEY,
    data JSONB NOT NULL
);

Here's the document shape I used for testing and writing this post -- it's representative of the event logs and audit trails I've encountered: a mix of primitive fields, nested objects, and metadata that accumulates over time.

-- Representative JSONB document
{
  "user_id": 5234,
  "event_type": "event_42",
  "timestamp": 1712341200,
  "session_id": "sess_abc123...",
  "ip_address": "192.168.1.42",
  "action": {
    "type": "click",
    "target_id": 87654,
    "coordinates": {"x": 512, "y": 768},
    "duration_ms": 1234
  },
  "device": {
    "type": "mobile",
    "os": "iOS",
    "screen_width": 1920,
    "screen_height": 1080
  },
  "performance": {
    "page_load_time": 1234,
    "dns_lookup": 123,
    "tcp_connection": 234,
    "server_response": 876
  },
  "custom_fields": { ... }
}

The queries that matter are straightforward equality and range filters on known fields: find all events for a given user, filter by event type, narrow to a time window. With this setup, we’ll try to discern which kind of index actually serves the specific access pattern, and what the real cost of each option is.

All tests run on PostgreSQL 18.2 in Docker on an Apple M-series host. Tables contain 50,000 rows with realistic JSONB event documents. Query benchmarks run 20 times on a warm cache and report avg/min/max. Insert benchmarks run 5 trials of 5,000 rows each. Schema and scripts are included throughout so you can reproduce these results.

Three Approaches to Indexing JSONB

There are three realistic options for this access pattern. Let’s look at each in turn – what it costs to build/maintain, what queries it actually helps, and where it falls down.

Option 1: GIN Indexes

The natural candidate for indexing a JSONB column would be a GIN (Generalized Inverted Index) index. After all, GIN indexes are specifically designed for JSON documents and full-text search. It indexes every key and value pair in every document, making the entire structure searchable:

CREATE INDEX idx_gin ON events USING GIN (data);
-- or the path-only variant:
CREATE INDEX idx_gin_path ON events USING GIN (data jsonb_path_ops);

As a refresher, I’ll mention that GIN is designed for containment and key existence operators (@>, ?, ?|, ?&), not for equality on extracted fields:

-- This query uses a GIN index correctly:
SELECT id FROM events WHERE data @> '{"user_id": 5234}';

-- This query does NOT use a GIN index, even if one exists:
SELECT id FROM events WHERE cast(data->>'user_id' AS INT) = 5234;

For the containment form, the GIN index is used and the query is fast – but still slower than a B-tree on the same field, because GIN lookups involve more bookkeeping:

-- GIN jsonb_ops + containment operator
Bitmap Index Scan on idx_gin
  Index Cond: (data @> '{"user_id": 5234}')

lanning Time: 1.173 ms  |  Execution Time: 1.295 ms

-- GIN jsonb_path_ops + containment operator
Bitmap Index Scan on idx_gin_path
  Index Cond: (data @> '{"user_id": 5234}')
Planning Time: 3.342 ms  |  Execution Time: 0.450 ms

The jsonb_path_ops variant is smaller and faster for containment queries, but it trades away support for key-existence operators (?, ?|, ?&). Neither GIN variant can help with range predicates like ts > 1700000000 – those always fall through to a filter step.

Option 2: Expression Indexes

Postgres lets you create an index on an expression, including JSONB extraction:

CREATE INDEX idx_user_id ON events (cast(data->>'user_id' AS INT));

This is a B-tree index on the result of evaluating the expression. When the query predicate matches the indexed expression exactly, and after ANALYZE has gathered statistics on it, the planner will use it:

SELECT id FROM events
WHERE cast(data->>'user_id' AS INT) = 5234;

Bitmap Heap Scan on t_expr
  Recheck Cond: ((data ->> 'user_id')::integer = 5234)
  Heap Blocks: exact=3
  ->  Bitmap Index Scan on idx_user_id
        Index Cond: ((data ->> 'user_id')::integer = 5234)
Planning Time: 1.168 ms  |  Execution Time: 0.341 ms

The execution time on this equality operator seems to be pretty similar to the performance of the GIN index.

Option 3: Generated Columns

Generated columns (available since PostgreSQL 12) let you extract JSONB values into regular typed columns at write time. The values are stored physically alongside the row and kept in sync automatically:

CREATE TABLE events (
    id         BIGSERIAL PRIMARY KEY,
    data       JSONB NOT NULL,
    user_id    INT    GENERATED ALWAYS AS ((data->>'user_id')::INT)    STORED,
    event_type TEXT   GENERATED ALWAYS AS (data->>'event_type')        STORED,
    ts         BIGINT GENERATED ALWAYS AS ((data->>'timestamp')::BIGINT) STORED,
    action     TEXT   GENERATED ALWAYS AS (data->'action'->>'type')    STORED
);

CREATE INDEX idx_user_id ON events (user_id);
CREATE INDEX idx_event_type ON events (event_type);
CREATE INDEX idx_ts ON events (ts);
CREATE INDEX idx_action ON events (action);

Queries against generated columns are plain typed-column lookups. The planner sees them as regular B-tree columns and produces tight estimates:

SELECT id FROM events WHERE user_id = 5234;

Bitmap Heap Scan on t_gen
  Recheck Cond: (user_id = 5234)
  Heap Blocks: exact=3
  ->  Bitmap Index Scan on idx_user_id
        Index Cond: (user_id = 5234)
Planning Time: 1.159 ms  |  Execution Time: 0.407 ms

You also get native support for range queries and composite indexes at no extra complexity – just combine columns as you normally would:

-- Indexed range query on generated timestamp column
CREATE INDEX ON events (event_type, ts);

SELECT id FROM events
WHERE event_type = 'event_42' AND ts > 1700000000;
-- Execution Time: 0.698 ms (vs 6.6 ms with GIN + post-filter)

Side-by-Side: Query Performance

With all three approaches set up, here are the warm-cache query results averaged over 20 runs for an equality filter on user_id:

Approach	Avg (ms)	Min (ms)	Max (ms)
GIN jsonb_ops + `@>`	0.198	0.101	1.769
GIN jsonb_path_ops + `@>`	0.197	0.032	3.115
Expression index	0.106	0.018	1.705
Generated column B-tree	0.112	0.016	1.839

Expression indexes and generated columns perform very similarly for equality queries—both around 0.1ms on warm cache. The real work is done in the B-tree lookup and both produce the same index structure. GIN with the correct @> operator is nearly as fast in PG 18.2 – still slightly slower than B-tree for this access pattern, but the gap has narrowed. GIN lookups still require a recheck step that B-tree lookups avoid, and the variance remains notable: GIN max of 3.1ms vs B-tree max of 1.8ms on warm cache.

The more surprising result is what happens if the GIN index is present but the query is written with extraction-based equality:

-- GIN index exists, but this query gets a seq scan:
SELECT id FROM events WHERE cast(data->>'user_id' AS INT) = 5234;
-- Execution Time: 47.935 ms (same as no index at all)

GIN doesn’t support that operator class. This is by far the most common confusion teams run into with JSONB indexing.

The Full Cost Picture: Storage and Writes

Storage

Here’s what the same 50,000 rows cost on disk under each approach:

Approach	Table size	Index size	Total
Expression indexes (4)	18 MB	3.5 MB	21 MB
Generated columns + B-tree (4)	20 MB	3.5 MB	23 MB
GIN jsonb_path_ops	18 MB	13 MB	31 MB
GIN jsonb_ops	18 MB	18 MB	36 MB

Expression indexes and generated column B-tree indexes produce identical index sizes for the same fields – this makes sense, since the index structures are the same; the only extra cost of generated columns is the 2 MB of additional stored column data in the table (~40 bytes per row for four typed columns). GIN indexes are substantially larger: 13–18 MB for a single index vs 3.5 MB for four targeted B-tree indexes. The jsonb_path_ops variant is smaller because it only stores value hashes for the @> operator path, but it still dwarfs the targeted approach.

One caveat: these numbers reflect documents with short keys and compact values. Documents with verbose key names, deeply nested structures, or large string values will inflate GIN indexes proportionally more – because GIN indexes every key path. B-tree and expression indexes are unaffected by document verbosity, since they only store the extracted value.

Write Throughput

Here’s what 5,000 INSERTs per trial, 5 trials each, on a table already containing 50,000 rows looked like:

Approach	Avg (ms)	Min (ms)	Max (ms)
Generated columns + B-tree (4)	157	91	317
Expression indexes (4)	163	93	366
GIN jsonb_path_ops	171	73	408
GIN jsonb_ops	334	225	525

Generated columns and expression indexes are now very close in write cost, with generated columns slightly edging out on average. GIN jsonb_path_ops has become more competitive with both. However, the default GIN jsonb_ops variant is dramatically more expensive: 2× slower than expression indexes and generated columns. It must decompose the entire document into key-value pairs and insert entries for each one. The high variance is also worth noting: GIN jsonb_ops max of 525ms vs 366ms for expression indexes.

Choosing the Right Approach

The benchmarks above tell a consistent story for workloads dominated by equality and range filters on a known set of fields:

Expression indexes are the lowest-cost migration path. They add no schema structure, require no application changes to insert logic, and impose minimal write overhead. If your team already has a table in production and just needs to speed up a handful of known slow queries, a well-placed expression index is your first move. The catch: every query must exactly match the expression as written in the index definition, which can be fragile to maintain as codebases evolve.
Generated columns take slightly more storage and impose more write overhead than expression indexes, but they offer something the others can’t: the extracted values become first-class columns. You can build composite indexes across them, reference them in views, expose them via ORMs, and sort or aggregate on them without embedding extraction logic everywhere. For new tables or for tables you’re willing to migrate, they’re the most maintainable long-term solution.
GIN indexes serve a different purpose. They’re the right tool when your query patterns are flexible or unknown – searching for the existence of a key, filtering on any field in an ad-hoc fashion, or supporting containment queries on arbitrarily-shaped documents. For those access patterns, they’re genuinely powerful and there’s no clean B-tree equivalent. But for consistent equality and range filters on known fields, they cost more in storage, impose higher write latency, and only work with one operator class (@>, not =).

Here’s a rough decision guide:

Situation	Recommended approach
Unknown or ad-hoc field queries	GIN (`@>`, key existence)
Known fields, few queries, no schema change	Expression index
Known fields, high query volume, evolving codebase	Generated columns
Known fields + range queries (e.g., timestamps)	Generated columns + composite B-tree
Mixed: some known fields + some ad-hoc	Generated columns + GIN (both)

Caveats and Considerations

Regardless of which approach you choose, a few things apply broadly:

The real win is making data typed and relational again. Generated columns aren’t magic. The reason they (and expression indexes) outperform GIN for equality filters is that they produce typed scalar values with precise statistics, letting the planner make accurate row-count estimates and choose cheap comparison operations. JSONB is flexible but opaque; once you extract a field into a typed column or expression, Postgres can reason about it properly.

Expression indexes require exact predicate matching. An index on cast(data->>'user_id' AS INT) will not be used by a query written as (data->>'user_id')::int. The cast form must be identical. Generated columns avoid this fragility – any query that references the column name will benefit.

Generated column expressions must be immutable. The expression cannot reference functions that depend on time, session state, or anything external. NOW(), CURRENT_USER, and similar functions are off-limits.

Generated columns cannot be directly updated. Their value is always derived from the source column. If you UPDATE the JSONB data, the generated columns recompute automatically.

GIN maintenance overhead compounds on write-heavy tables. GIN indexes build an internal pending list and flush it periodically (controlled by gin_pending_list_limit). Under sustained write load, this flushing can cause the latency spikes visible in the benchmark max values above. B-tree indexes don’t have this mechanism.

These benchmarks cover one dataset shape and one machine. At much larger row counts (hundreds of millions), cache-miss behavior and index bloat will dominate—relative rankings should hold, but absolute numbers will differ. When in doubt, benchmark on your own data before committing to a migration.

Conclusion

For workloads dominated by equality and range filters on a predictable set of JSONB fields, the data is clear: B-tree indexes on typed values – whether via expression indexes or generated columns – outperform GIN both on read latency and write throughput. GIN’s strength is flexibility, not speed for known-field access patterns; when you know exactly which fields you’ll filter on, a targeted B-tree beats the GIN every time.

If you’re starting from scratch or are willing to migrate a table, generated columns are the most maintainable path. They make your frequently-queried fields easily accessible, eliminate JSONB extraction logic from your application’s query layer, and support composite indexes and range queries naturally. If you need to add indexing to an existing table without a schema change, expression indexes get you 90% of the way there with a fraction of the write overhead.

GIN still belongs in your toolkit – but for the right job: ad-hoc containment searches, key-existence checks, and cases where the query patterns genuinely vary by document. For everything else, make your JSONB fields relational.

Potential Consequences of Using Postgres as a Job Queue

Mon, 04 May 2026 06:00:00 +0000

This post was originally published on the Microsoft Tech Community Blog.

Introduction

At small scale, using Postgres as a job queue is totally fine, and I’d even say it’s the right call. Fewer moving parts, one less system to manage, ACID guarantees on your jobs. What’s not to love?

The problem is that “small scale” has a ceiling, and the ceiling is lower than most people expect. When you’ve got thousands of concurrent workers hammering a jobs table with SELECT ... FOR UPDATE SKIP LOCKED, things start to behave in ways that aren’t obvious from the application layer. CPU usage creeps up. Also vacuum sometimes can’t keep up. Finally, in the wait event stats, you start seeing ominous entries like LWLock:MultiXactSLRU stacking up across many backends.

This pattern has tripped up teams more than a few times, and it usually plays out the same way: everything works fine in dev and staging, then goes off a cliff in production once the concurrency gets real. So let’s dig into why this happens, and what the alternatives look like.

The Typical Pattern

When using Postgres as a job queue, the standard approach looks something like this:

CREATE TABLE job_queue (
    id         bigserial PRIMARY KEY,
    status     text NOT NULL DEFAULT 'pending',
    payload    jsonb NOT NULL,
    created_at timestamptz NOT NULL DEFAULT now(),
    locked_by  text,
    locked_at  timestamptz
);

CREATE INDEX idx_job_queue_status ON job_queue (status) WHERE status = 'pending';

Workers grab jobs with:

UPDATE job_queue
   SET status = 'processing',
       locked_by = 'worker-42',
       locked_at = now()
 WHERE id = (
     SELECT id FROM job_queue
      WHERE status = 'pending'
      ORDER BY created_at
      LIMIT 1
        FOR UPDATE SKIP LOCKED
 )
 RETURNING *;

And then mark them done:

UPDATE job_queue SET status = 'completed' WHERE id = $1;

Some users may DELETE the row entirely. Either way, the lifecycle is: insert, lock-and-update, update-or-delete. Repeated thousands of times per second.

At low concurrency, this works very smoothly. SKIP LOCKED means workers don’t block each other waiting for the same row. Postgres handles the locking, visibility, and ordering. It’s elegant.

So where does it break?

The MultiXact SLRU Problem

When multiple transactions hold locks on the same row, Postgres stores the set of lockers as a MultiXact ID – a pointer into a side structure under pg_multixact/.

With SELECT ... FOR UPDATE SKIP LOCKED, users might think MultiXacts aren’t involved – after all, SKIP LOCKED is supposed to avoid contention. But in practice, with many concurrent workers all racing to lock rows, there are brief windows where multiple transactions reference the same row before one of them “wins” and the others skip. If you combine this with any FOR SHARE or FOR KEY SHARE locks (which are commonly created implicitly by foreign key checks), MultiXact IDs start accumulating quickly.

The MultiXact data lives in SLRU buffers (Simple Least Recently Used) – a small, fixed-size shared memory cache. When backends need to read or write MultiXact data, they acquire LWLocks to access these buffers. Under high concurrency, this becomes a bottleneck:

wait_event_type | wait_event
-----------------+-------------------
LWLock          | MultiXactMemberSLRU
LWLock          | MultiXactOffsetSLRU

You’ll see dozens or hundreds of backends piled up on these waits. The SLRU cache is small (by design – it’s a fixed number of pages in shared memory), and when the working set of MultiXact lookups exceeds what fits in the cache, you get constant eviction and re-reads from disk. Every lock acquisition and release on a job row potentially triggers a MultiXact SLRU lookup, and at thousands of concurrent sessions, those lookups serialize on LWLocks.

The result: CPU gets pegged, throughput collapses, and latency spikes – not because the queries are expensive, but because the locking infrastructure itself is overwhelmed.

Bloat: The Silent Killer

The other side of this coin is table and index bloat. Every job row goes through multiple updates (and possibly a delete), and each of those operations creates a new tuple version in the heap. The old versions stick around until VACUUM cleans them up.

On a busy job queue table:

Dead tuples accumulate faster than autovacuum can clean them. By the time autovacuum finishes one pass, tens of thousands of new dead tuples have appeared. The table grows and grows.
Index bloat compounds the problem. Every index on the table also accumulates dead entries. The partial index on status = 'pending' gets thrashed especially hard, since rows constantly enter and leave that condition.
Sequential scans get slower. As the table bloats, even index scans start doing more I/O because the heap pages are sparsely populated. Vacuum reclaims space at the end of the table, but can’t reclaim space in the middle (unless the pages are completely empty).

Job queue tables can grow to tens of gigabytes when the actual “live” data was only a few megabytes. It makes everything slower: scans, vacuum, even pg_dump.

You can mitigate this by running vacuum more aggressively (lower autovacuum_vacuum_scale_factor, higher autovacuum_vacuum_cost_limit), or by partitioning the table and dropping old partitions. But at some point, you’re fighting the fundamental mismatch between MVCC’s design goals and the write pattern of a job queue.

CPU and Lock Overhead

Beyond the SLRU contention and bloat, there’s just the raw overhead of using Postgres’s full transactional machinery for what is essentially a FIFO dispatch operation:

Every lock/unlock is a full WAL-logged transaction. Grabbing a job writes WAL. Marking it complete writes WAL. Deleting it writes WAL. On a system processing thousands of jobs per second, the WAL volume from the job queue alone can saturate your wal_writer and checkpoint processes.
SKIP LOCKED still touches rows. The name suggests rows are skipped, but Postgres still has to find them, check their lock status, and move on. With high concurrency, many workers end up scanning past the same locked rows before finding one they can claim. This is wasted CPU.
Snapshot management overhead also becomes an issue. Each transaction needs a consistent snapshot, and with thousands of concurrent transactions, the ProcArray (the structure that tracks active transactions) becomes a contention point itself. You might see LWLock:ProcArrayLock waits alongside the MultiXact ones.
Vacuum contention. While vacuum is cleaning up dead tuples, it needs locks too. On a table under constant write pressure, vacuum can interfere with the workers and vice versa. I’ve seen systems where disabling autovacuum on the job queue table improved throughput in the short term.

Better Alternatives

So what should you use instead? It depends on your requirements, but there are several options that handle high-throughput job dispatch more gracefully than a Postgres table.

Advisory Locks (Staying in Postgres)

If you want to stay within Postgres and avoid adding infrastructure, advisory locks are worth considering for certain queue patterns. Instead of locking rows, you lock on an abstract numeric key:

-- Worker tries to acquire a lock on the job ID
SELECT pg_try_advisory_lock(id) FROM job_queue
 WHERE status = 'pending'
 ORDER BY created_at
 LIMIT 1;

Advisory locks are lightweight – they don’t touch the heap, don’t create MultiXact entries, and don’t generate dead tuples. They live entirely in shared memory. The trade-off is that you lose the atomicity of FOR UPDATE SKIP LOCKED: you need to handle the case where a lock is acquired but the job processing fails, and you need to release the lock explicitly (or rely on session-end cleanup).

This approach works well when the queue depth is manageable and you want to avoid the MVCC overhead. But it’s still Postgres, so you’re still subject to connection limits, ProcArray overhead, and general resource contention at very high session counts.

pgq (Skytools)

pgq is purpose-built for exactly this problem. It’s a queue implementation that sits inside Postgres but uses a batching model that avoids most of the row-level locking and MVCC pitfalls. Events are written to a queue table, but consumers read them in batches and the queue maintenance is done via a ticker process that manages rotation.

The key advantages:

No row-level contention. Consumers don’t lock individual rows.
Built-in batch processing. Events are consumed in chunks, reducing transaction overhead.
Efficient cleanup. Old events are rotated out rather than vacuumed row-by-row.

The downside is that pgq is not as actively maintained as it once was, and it adds operational complexity (the ticker daemon, consumer registration, etc.). But for teams already deep in the Postgres ecosystem, it’s a battle-tested option.

PgQue

Coincidentally, during the writing of this post, Nikolay Samokhvalov has built PgQue, which is a derivative of pgq. Like pgq, it sits inside Postgres, but ships as a single SQL file – no C extension and no external daemon – making it deployable on managed services like RDS, Aurora, Cloud SQL, AlloyDB, Supabase, and Neon. Producers INSERT events into rotating event tables (recycled via TRUNCATE instead of row-by-row deletion), and consumers read batches by diffing two pg_snapshot values captured by a periodic ticker – so the hot path contains zero UPDATEs, DELETEs, or SELECT ... FOR UPDATE SKIP LOCKED, and therefore produces no dead tuples on the event tables. For a deeper dive into the algorithm, see Christophe Pettus’s writeup.

Redis

For many teams, Redis is the natural choice for job queues. Using Redis lists (BRPOPLPUSH or the Streams API), you get:

Sub-millisecond dispatch latency. No disk I/O, no MVCC, no vacuum.
Atomic pop operations. Workers grab jobs without any locking protocol.
Simple scaling. Redis handles thousands of concurrent consumers trivially.

The trade-off is durability. Redis can persist to disk, but it’s not ACID. If Redis crashes between a pop and the job completing, you might lose or duplicate work (though Redis Streams with consumer groups mitigate this significantly). For most job queue use cases, at-least-once delivery is acceptable, and Redis does that well.

Kafka

For truly high-throughput, distributed workloads, Apache Kafka is the heavyweight option. Kafka partitions give you parallel consumption with ordering guarantees per partition, durable storage, and replay capability. It’s the right tool when:

You need to process thousands of events per second
Multiple consumers need to read the same events
You want event replay or audit trails
Your architecture is already event-driven

The operational overhead is nontrivial – ZooKeeper (or KRaft), brokers, topic management, consumer group coordination. But for teams already running Kafka for other reasons, adding a job queue topic is practically free.

Choosing the Right Tool

Here’s a rough decision guide:

Scenario	Recommendation
Under 100 concurrent workers, simple jobs	Postgres with `SKIP LOCKED` is fine
Moderate concurrency, want to stay in Postgres	Advisory locks or pgq
High throughput, low-latency dispatch	Redis (Lists or Streams)
Massive scale, distributed, event replay	Kafka

Many teams that start with Postgres (reasonably) hit scaling problems and then try to fix Postgres rather than recognizing that the workload has outgrown the tool. They throw more autovacuum workers at it, increase max_connections, add connection poolers – all of which help at the margins, but don’t address the fundamental issue: Postgres’s MVCC and locking machinery wasn’t designed for this access pattern at high concurrency.

Conclusion

Postgres is great, but it can’t be the best tool for every job. Using it as a job queue is a perfectly valid choice when your scale is modest. But when you’re running thousands of concurrent workers, the combination of MultiXact SLRU contention, heap bloat, vacuum pressure, and raw locking overhead will eventually push you toward a purpose-built solution.

The good news is that you don’t have to rip out everything. Advisory locks can buy you headroom without adding infrastructure. Redis can handle dispatch while Postgres keeps owning the data. And if you’re already using Kafka, a job topic is a natural fit. Take your pick – there are many queueing options out there!

Understanding Bitmap Heap Scans in PostgreSQL

Mon, 27 Apr 2026 08:00:00 +0000

Introduction

When people first start reading PostgreSQL execution plans, they quickly learn a few common scan types: Seq Scan, Index Scan, Index Only Scan. But eventually another one appears that is less obvious: Bitmap Heap Scan, which is almost always accompanied by Bitmap Index Scan.

At first glance, it sounds like two scans on the same table – a very inefficient choice?! But bitmap scans are actually one of the planner’s most practical tools for balancing random I/O vs sequential access. Understanding how they work can make execution plans much easier to interpret, so we’ll dive into that a little bit today.

The Basic Idea

A bitmap scan is a two-step process:

Step 1: Build a bitmap of matching rows using one or more indexes.

Step 2: Visit the heap pages containing those rows referenced in the bitmap.

In an execution plan this usually appears as:

Bitmap Heap Scan on orders
-> Bitmap Index Scan on orders_customer_id_idx

The important part is that the index lookup and heap access are separated – this separation allows Postgres to explain heap access costs and actuals more clearly.

Why Not Just Use an Index Scan?

With a normal index scan, the query executor does something like this:

Find a matching entry in the index
Jump to the heap page
Fetch the row
Repeat

If the query returns only a few rows, this works well. But if the query returns thousands of rows scattered across the table, the database ends up doing many random heap fetches. Random I/O can become expensive, so a bitmap scan solves this problem.

How the Bitmap Is Built

During the Bitmap Index Scan phase, the executor does not immediately fetch rows. Instead it records which heap pages contain matching rows. Conceptually, the structure looks like this:

Page 101 -> rows 2, 7
Page 205 -> rows 1, 3, 8
Page 410 -> row 5

These page references are stored as a bitmap structure in memory. Once the bitmap is complete, the executor can visit heap pages in physical order rather than jumping around randomly. Visiting heap pages in physical order means less random I/O and therefore less latency.

Multiple Indexes Can Be Combined

One particularly powerful feature is that bitmap scans allow the query planner to combine multiple indexes. For example:

WHERE status = 'active'
AND created_at >= '2025-01-01'

The plan might look like:

Bitmap Heap Scan
-> BitmapAnd
-> Bitmap Index Scan on status_idx
-> Bitmap Index Scan on created_at_idx

Each index produces a bitmap, and the planner combines them using logical operations, such as BitmapAnd and BitmapOr. This allows the planner to efficiently use multiple indexes even when a single composite index does not exist.

When Does the Planner Chooses Bitmap Scans?

The planner usually prefers bitmap scans in situations where the query returns more rows than a typical index scan, but not enough rows to justify a full sequential scan. In other words, bitmap scans often appear in the middle selectivity range.

Very roughly:

Selectivity	Likely Plan
Very small	Index Scan
Medium	Bitmap Heap Scan
Very large	Seq Scan

This is not a strict rule, but it helps explain the planner’s reasoning.

Pros and Cons

As with everything in databases, there’s no free lunch. Here are some advantages and disadvantages for bitmap scans

Advantages of Bitmap Heap Scans
- Reduced Random I/O: By grouping heap page accesses, bitmap scans avoid excessive random disk reads.
- Ability to Combine Indexes: Bitmap operations allow the query planner to use multiple independent indexes efficiently.
- Better Performance for Medium Selectivity: Queries returning thousands of rows often benefit from bitmap access patterns.
- Predictable Heap Access: Because heap pages are visited in order, caching behavior tends to improve.
Disadvantages of Bitmap Heap Scans
- Memory Usage: The bitmap structure is stored in memory. If the result set becomes too large, the query executor may switch to a lossy bitmap, where only page-level information is stored. This can cause additional filtering work later.
- Two-Phase Execution: Because the bitmap must be built before heap access begins, the query cannot stream rows immediately. This can increase latency for queries expecting early rows.
- Extra CPU Work: Maintaining and combining bitmap structures adds overhead compared to simple index scans.

Lossy Bitmaps

When memory limits are reached, the query executor may degrade the bitmap representation. Instead of tracking individual tuple offsets, it only records:

Page 205 -> possible matches

During the heap scan, the executor must then recheck all rows on that page. In execution plans you may see mention of Recheck Cond. This indicates that the bitmap became lossy. While still correct, this can reduce efficiency.

Final Thoughts

Bitmap heap scans are one of the planner’s most practical optimization tools, as they allow the database to reduce random I/O, combine multiple indexes, and handle medium-sized result sets efficiently.

While they may look complicated at first, the core idea is simple: Find matching rows first, then fetch heap pages efficiently. What a great concept!

The Postgres Performance Triangle

Mon, 20 Apr 2026 08:00:00 +0000

Everyone who’s gone at least knee-deep in photography knows there’s this idea of the exposure triangle: aperture, shutter speed, and ISO. Depending on what you’re going for artistically, you adjust the three parameters, knowing that there are trade-offs in doing so. After working on a few cases, and presenting solutions to customers, I’ve started to think about Postgres performance tuning in a similar way – there are basic parameters that can be tuned, and there are trade-offs for the choices DBAs make:

Memory Allocation
Disk I/O
Concurrency

Each of these (in broad strokes) affects throughput – how much work your system gets done.

Caveat: I know that in the academic sense, “throughput” doesn’t quite capture the balance of these concepts, but please bear with me!

Let’s talk about how each of these three work together with the whole system, and what the trade-offs look like.

Memory Allocation

When you increase memory allocation in Postgres, whether it’s shared_buffers or work_mem, things tend to feel smoother. Most notably, queries spill to disk less often, sorts and joins stay in memory, cache hit rates improve. But there’s a trade-off that’s easy to miss at first, especially with these two parameters. A single complex query can consume multiple chunks of work_mem (see Laetitia’s excellent post about it). Multiply that across concurrent queries, and you begin to see the OS consuming swap space, churning at checkpoints, and even OOM Killer getting invoked. So while more memory can make things faster, it also quietly reduces how much concurrency your system can safely handle.

I’d relate this to aperture – you can throw money at some fast glass, but you also get shallower depth of field (in an annoying way).

Disk I/O

Disk is where things go when memory isn’t enough, or when an access pattern requires it. We see examples of this in sequential scans, random index lookups, and temporary files from sorts or hashes. Lowering work_mem might increase disk I/O due to sorts spilling to temp files, for example. We can try to minimize disk I/O by adding indexes, increasing work_mem, or simply rewriting queries.

Another way we can try to affect disk I/O is to tinker with the costs, to encourage the query planner to choose one scan method over the other. In any case, our attempts to balance disk I/O and memory usage can be pretty straightforward at first, but could become complicated at scale. That’s where partitioning and read-only replicas come in, but I’m beginning to digress…

Indexes, in particular, are where things start to get interesting. Adding an index can feel like an easy win, as it leads to fewer rows scanned and less CPU work per query, along with less disk activity, but there are trade-offs:

Every INSERT will update every relevant index
Every UPDATE can potentially rewrite index entries
Every DELETE leaves behind cleanup work (vacuum)

At scale, we also see other effects:

Indexes get large
Cache hit rates drop (because there’s more to cache)
Random I/O increases

So an index that helps one query might quietly make others worse, or make writes more expensive.

It’s like raising ISO to compensate for low light. You get the shot, but the noise shows up somewhere else.

Concurrency

So far, this has all been somewhat per-query. But things change when you introduce concurrency. In a high-demand service, the instinct is to increase max_connections to allow the service to scale up, but in my experience there’s a price to pay for this kind of concurrency. Some people fail to notice that each connection brings its own memory usage, takes up a spot in Postgres’ internal data structures, and puts the system at risk for increased CPU demand and resource contention.

In the photography analogy, you can turn down the ISO very low on a bright and sunny day, but that won’t be enough. Soon, you’ll be closing the aperture and increasing the shutter speed, and then you lose your ability to create the artistic feel that you’re actually trying to go for. So what do photographers do? They use an ND filter to limit how much light hits the sensor.

In Postgres, that “ND filter” is something like a connection pooler, like PgBouncer. Instead of letting thousands of connections compete for CPU: You cap active queries, you allocate more resources to each actual DB session, and you trade a bit of latency for stability. Sometimes, to keep your throughput, you need some additional accessories.

The Art of Postgres

As a DBA, you can calculate optimal index usage, memory sizing, and expected I/O patterns, but those calculations tend to assume a steady state. Every DBA knows that real production systems are always changing, due to traffic patterns, scaling, and new features getting rolled out on the application side. As the organization changes, the work to keep the database performant is dependent upon the DBA being both a Database Administrator as well as a Database Artist, working with internal teams to know which indexes to add/drop, how much concurrency to allow, and how to allocate memory without running out of it.

Instead of asking, “What’s the optimal configuration?” it might be more useful to ask these questions:

Where is my system currently paying the cost—memory, disk, or CPU?
If I relieve pressure here, where does it move?
How much can we tolerate that new pressure?

Costs don’t disappear – they just shift – and it’s the DBA’s job to help decision-makers decide where to shift it to.

Conclusion

There’s more to photography than exposure – there’s composition, color-correction, external lighting, and so much more. In the same way, this discussion has just been one part of database administration. There’s so much more to go over, in terms of creating a robust and scalable database. I wanted to highlight this topic because I do find that some users tend to approach database architecture without considering all the trade-offs. We can definitely get the database to peform well, but there’s no one-size-fits-all solution for every situation. It takes thought, planning, testing, and discussion with stakeholders to come up with a good solution.

Understanding PostgreSQL Wait Events

Mon, 13 Apr 2026 08:00:00 +0000

Introduction

One of the most useful debugging tools in modern PostgreSQL is the wait event system. When a query slows down or a database becomes CPU bound, a natural question is: “What are sessions actually waiting on?” Postgres exposes this information through the pg_stat_activity view via two columns:

wait_event_type
wait_event

These fields reveal what the backend process is blocked on at a given moment. Among the different wait types, one category tends to cause confusion:

LWLock

If you’ve ever seen dashboards full of LWLock waits, you’re not alone in wondering what they mean and whether they’re a problem.

Where Wait Events Appear

The easiest way to see wait events is:

SELECT pid,
wait_event_type,
wait_event,
state,
query
FROM pg_stat_activity
WHERE state != 'idle';

Example output might look like:

pid	wait_event_type	wait_event	state
1234	Lock	transactionid	active
5678	LWLock	buffer_content	active
9012	IO	DataFileRead	active

Each category represents a different kind of wait. Common types include:

Lock
LWLock
IO
Client
IPC
Activity

Among these, LWLock waits often appear during performance incidents.

What Is an LWLock?

LWLock stands for Lightweight Lock. These are internal Postgres synchronization primitives used to coordinate access to shared memory structures. Note that they are NOT related to lock contention on tables, or deadlocking when performing DML. LWLocks protect important internal structures such as:

shared buffers
WAL buffers
lock tables
SLRU caches

Because these structures are accessed by many processes simultaneously, Postgres must coordinate access carefully.

Why LWLock Waits Appear

In healthy systems, LWLocks are acquired and released very quickly. However, they can become visible when:

contention increases
many sessions access the same internal structure
CPU saturation occurs
shared memory structures become hot spots

Seeing LWLock waits in pg_stat_activity doesn’t automatically mean something is wrong. But persistent LWLock contention usually indicates a scaling issue somewhere in the workload.

Common LWLock Wait Events

A few LWLock events appear frequently during real-world incidents.

Understanding them can help narrow down the root cause.

buffer_content

wait_event_type = LWLock
wait_event = buffer_content

This occurs when Postgres processes compete to access a shared buffer page.

Typical causes include:

many concurrent updates to the same rows
heavy index modifications
hot tables receiving high write volume

If you see these locks, try these troubleshooting steps:

check for write-heavy workloads
inspect tables experiencing frequent updates
look for missing indexes causing excessive page access

WALWriteLock

wait_event = WALWriteLock

This indicates contention while writing to the Write-Ahead Log (WAL).

Common causes:

high write throughput
large batch inserts or updates
slow storage affecting WAL flushes

Possible diagnostic steps:

examine WAL generation rate
check disk latency
review bulk write workloads

In some systems this appears as commit latency spikes.

WALInsertLock

wait_event = WALInsertLock

This occurs when multiple sessions attempt to insert WAL records simultaneously. It usually appears when:

many concurrent transactions are committing
high insert/update workloads exist
transaction throughput is extremely high

Postgres versions over time have reduced contention here by increasing WAL insertion slots. Still, very high write concurrency can trigger it.

ProcArrayLock

wait_event = ProcArrayLock

This lock protects Postgres’ internal structure tracking active transactions. It is often associated with:

snapshot creation
visibility checks
large numbers of active connections

Possible causes include:

very high connection counts
long-running transactions
frequent snapshot creation

Connection pooling (and lowering max_connection) often helps reduce this type of contention.

CLogControlLock / SLRU Locks

wait_event = CLogControlLock

These involve the SLRU (Simple Least Recently Used) subsystem, which tracks transaction commit status. Heavy contention here can appear when:

extremely high transaction rates exist
frequent visibility checks occur
many short transactions are executed

Diagnosing LWLock Problems

When investigating LWLock waits, a few steps usually help.

1. Look for dominant wait events

Start by identifying which LWLock appears most frequently:

SELECT wait_event, count(*)
FROM pg_stat_activity
WHERE wait_event_type = 'LWLock'
GROUP BY wait_event
ORDER BY count(*) DESC;

2. Examine workload characteristics

Questions to ask:

Are there many concurrent writers?
Is a single table receiving heavy updates?
Are there extremely high transaction rates?

3. Check connection counts

Large numbers of connections can amplify contention. Connection pooling often reduces LWLock pressure significantly.

4. Look at query patterns

High-frequency queries touching the same rows or pages can create hotspots.

Final Thoughts

PostgreSQL’s wait event system provides valuable insight into what the database is doing internally. LWLocks, in particular, reveal contention inside shared memory structures that are otherwise invisible. When investigating performance issues, a good rule of thumb is: If many sessions are waiting on the same LWLock, there is usually a workload hotspot somewhere. Once you know where the contention lives, the path toward fixing it becomes much clearer.

WAL as a Data Distribution Layer

Mon, 06 Apr 2026 08:00:00 +0000

Introduction

Every so often, I talk to someone working in data analytics who wants access to production data, or at least a snapshot of it. Sometimes, they tell me about their ETL setup, which takes hours to refresh and can be brittle, with a lot of monitoring around it. For them, it works, but it sometimes gets me wondering if they need all that plumbing to get a snapshot of their live dataset. Back at Turnitin, I set up a way to get people access to production data without having to snapshot nightly, and I thought maybe I should share it with people here.

Common Implementations and Their Risks

Typical solutions that we might encounter as we give people a little bit of access to production data:

1. Query the primary

This is generally a bad idea, since you don’t want users getting access to the production prirmary, lest they make some mistakes or do something to lock up tables that prevent customers from using your apps. Even with a read-only user, large data analytics queries could cause unwanted interference that negatively affect your uptime. This is almost certainly not the way to go.

2. Query a streaming replica

This is better, but doing this is not free. Long-running queries can create replay lag, vacuum conflicts can cancel queries, and I/O contention can affect the primary upstream. It’s safer since users are forced to be read-only, but that still carries risk.

3. Nightly snapshots / rebuilds

Having time-based snapshots and rebuilds are the most common form of getting data out to analysts. ETL queries run at night (or some other specified regular interval) and provide the information needed to do the necessary work. This works, but is another piece of software that produces somewhat stale data, depending on how much stale-ness can be tolerated.

Once Upon a Time, Before Streaming Replication

If you’ve spent any time in Postgres, you already understand streaming replication. Primary sends WAL to standby, and standby replays the WAL stream. All the tutorials talk about using pg_basebackup, setting hot_standby and standby.signal and configuring primary_conninfo.

However, many people don’t know that before streaming replication, there was log shipping. Introduced in v. 8.2, it was the predecessor to what eventually became hot standby/streaming replication in v. 9.0. Instead of maintaining a live connection between primary and standby, the two clusters are decoupled. WAL files are shipped (via scp or rsync or some other mechanism – maybe even NFS) to the replica, and then replayed there.

Log Shipping Hits a Different Point on the Tradeoff Curve

With WAL log shipping the standby never connects to the primary, and the primary never tracks the standby, and therefore there is no backpressure mechanism (i.e. no cancelled queries because of conflict with recovery, no need for hot_standby_feedback).

While you may not get up-to-the-millisecond minimized replication lag, you get pretty close to real-time data. In some cases, this lag may even be desirable – you could throttle the playback so you are an hour behind, even giving yourself some time to look at a table’s state before someone fat-fingers an UPDATE without a WHERE clause.

A Subtle but Important Detail

Postgres doesn’t force you to choose one mechanism over the other. A standby can use both primary_conninfo AND restore_command. The way it works is that it will toggle between the two, depending on availability. If the primary is disconnected for some reason, it will switch over to restore_command until it cannot find the WAL file it wants, and then it flips back to primary_conninfo again.

Log shipping isn’t just a legacy mode, but it’s part of the replication continuum. It’s like incremental backup, except that your backup is always full-loaded and can be queried against. For these reasons, keeping your WAL files around is a very good practice.

Architecture Pattern: Introduce a WAL Hub

Instead of thinking in terms or replication happening between a primary and a number of standbys, it may be useful to think about a central WAL archive host, even if it’s an S3 bucket, so that many consumers can access data at any point in time.

These consumers can be analytics standbys, QA environments, or ad-hoc data sandboxes – or whatever else you want to give a copy of near-realtime production data to, without risking replication backpressure or compromising network security.

A Hands-On Approach

I created a simple demo that sets this up end-to-end. It sets up 3 containers in Docker – a primary, standby, and a mock WAL archive location. Disclaimer: yes, I used AI to help me generate the scripts, but it’s exactly how I had it set up at Turnitin (yes, we used rsyncd back in 2009 – there might be better stuff out there these days).

Some key configuration params for clarity:

archive_command pushes WAL files to a directory
restore_command pulls WAL files on the standby
standby.signal enables continuous recovery
hot_standby=on allows read-only queries
archive_mode=on not entirely necessary, but for posterity

Note that in this example, some characteristics of the standby:

No primary_conninfo
No replication slots used
No entries in pg_stat_replication show up on the primary.

If you want, you can set up traditional streaming replication in parallel to this log shipping standby – it doesn’t interfere with the log shipping so long as WAL files get to the archive location.

Why This Pattern Deserves More Attention

Most teams default to streaming replication because it’s the most visible feature.

But Postgres replication isn’t one thing; it’s a set of primitives:

WAL generation
WAL transport
WAL replay

Streaming replication couples all three and log shipping lets you separate them. And once you do that, new architectures open up!