<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Richard Yen</title>
    <description></description>
    <link>http://richyen.com/</link>
    <atom:link href="http://richyen.com/feed.xml" rel="self" type="application/rss+xml"/>
    <pubDate>Mon, 15 Jun 2026 19:19:18 +0000</pubDate>
    <lastBuildDate>Mon, 15 Jun 2026 19:19:18 +0000</lastBuildDate>
    <generator>Jekyll v3.10.0</generator>
    
      <item>
        <title>Disaster Recovery is a Process, Not a Tool (Part 1)</title>
        <description>&lt;h2 id=&quot;the-landscape-has-changed&quot;&gt;The Landscape Has Changed&lt;/h2&gt;

&lt;p&gt;When I was at Turnitin, we were still kind of riding the tail end of the dot-com boom.  People were rushing to ship things, and brief outages were not exactly &lt;em&gt;good&lt;/em&gt;, but they were considered a normal part of running software on the internet.  If the site was down for a few minutes, you’d shrug, dig in, and fix it.&lt;/p&gt;

&lt;p&gt;That’s not really the world we live in anymore.  Uptime is much more sensitive than it used to be.  Five nines used to be the stretch goal – now four nines is something a lot of teams just treat as the expectation, and even a few minutes of outage in a month feels like a lot.  We don’t really track averages in our metrics anymore, either; we track p99 latencies, because we actually care about that last 1% of users having a good experience.&lt;/p&gt;

&lt;p&gt;The other thing that’s changed is how quickly outages get socialized.  A noticeable hiccup in your service can end up on social media before your on-call has even finished acknowledging the page.  In my experience, the worst situations are the ones where customers find out about an issue before the company does.  That has both a financial cost and a reputational cost, and the reputational cost tends to linger long after the incident is resolved.  Frequent outages chip away at users’ willingness to keep using your product.&lt;/p&gt;

&lt;p&gt;Postgres is, of course, no exception.  So that’s the world a Postgres DR plan has to operate in.&lt;/p&gt;

&lt;h2 id=&quot;what-counts-as-a-disaster&quot;&gt;What Counts as a Disaster?&lt;/h2&gt;

&lt;p&gt;When people hear “disaster recovery,” I think the natural mental picture is a natural disaster – a flood, an earthquake, a wildfire, or maybe a long utility outage that takes a data center offline.  And those are real concerns; we put generators and solar panels and multi-region replication in place partly to deal with exactly that.&lt;/p&gt;

&lt;p&gt;But in my experience, most of the disasters that take a Postgres database down don’t look anything like that.  They look like:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;A performance regression after a failover, where the service is technically “up” but slow enough that customers can’t really use it.&lt;/li&gt;
  &lt;li&gt;Corruption from a bad migration – something the deployment pipeline didn’t catch, and now half the rows in a table look wrong.&lt;/li&gt;
  &lt;li&gt;A security incident, where somebody got in and may have tampered with data.&lt;/li&gt;
  &lt;li&gt;A subtle application bug that writes the wrong values, or reads them back the wrong way, for days before anyone notices.&lt;/li&gt;
  &lt;li&gt;Replication that quietly broke, or WAL that quietly went missing.&lt;/li&gt;
  &lt;li&gt;Accidental deletes – the classic missing &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WHERE&lt;/code&gt; clause.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If I had to put a definition on it, I’d probably say something like: a disaster is any sustained event that compromises a system’s availability, correctness, or business trust.  Availability is the one that gets the most attention, but the other two are arguably more dangerous, because they tend to be discovered later and resolved with less confidence.&lt;/p&gt;

&lt;h2 id=&quot;how-dr-is-usually-done&quot;&gt;How DR Is Usually Done&lt;/h2&gt;

&lt;p&gt;If you ask most teams how they do disaster recovery, you’ll usually hear two words – not because they’re wrong, but because they’re the first things that come to mind.  Those words are &lt;strong&gt;preparation&lt;/strong&gt; and &lt;strong&gt;prevention&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Preparation looks like checklists, backups, monitoring, scenarios to think through, and playbooks of various levels of detail.  Prevention looks like alerting, automated remediation, self-healing systems, Patroni, redundancy, load balancing, and so on.  Both are good.  Benjamin Franklin’s “an ounce of prevention is worth a pound of cure” is on the wall of more than one ops team I’ve worked with, and there’s a reason for that – prevention really is cheaper than recovery on average.&lt;/p&gt;

&lt;p&gt;But preparation and prevention only get you so far, and I don’t think they’re really the same thing as recovery.  Recovery is what happens after preparation and prevention have already failed to keep the lights on.  It’s the act of taking a system that’s already down (or already untrustworthy) and restoring business operations.&lt;/p&gt;

&lt;p&gt;That distinction sounds almost too obvious to say out loud, but in my experience it’s the part teams are least ready for.  A lot of the customers I worked with at EDB were genuinely well-prepared, with great backups and good monitoring, and they were &lt;em&gt;still&lt;/em&gt; unprepared the day they actually had to recover.  I’ve been in that seat too, as a DBA – everything was in place on paper, and we still fumbled the first real incident.  Recovery is its own skill.&lt;/p&gt;

&lt;h2 id=&quot;postgres-already-gives-us-most-of-the-tools&quot;&gt;Postgres Already Gives Us Most of the Tools&lt;/h2&gt;

&lt;p&gt;One nice thing about Postgres is that the toolbox for recovery is already pretty good.  Off the top of my head, there’s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pg_dump&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pg_restore&lt;/code&gt; for logical backups, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pg_basebackup&lt;/code&gt; for physical ones, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pg_stat_replication&lt;/code&gt; to see what your standbys are doing, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pg_stat_activity&lt;/code&gt; to see what your sessions are doing, point-in-time recovery for anything more granular than “last night’s backup,” and tools like &lt;a href=&quot;https://www.repmgr.org/&quot;&gt;repmgr&lt;/a&gt; and &lt;a href=&quot;https://www.enterprisedb.com/docs/efm/latest/&quot;&gt;EFM&lt;/a&gt; (and pgBackRest, Barman, and others) for orchestration and richer backup workflows.&lt;/p&gt;

&lt;p&gt;These tools are not the bottleneck.  In nearly every case I worked at EDB, the question wasn’t “do we have the technology to recover?”  It was, “do we know &lt;em&gt;when&lt;/em&gt; to use it, &lt;em&gt;how&lt;/em&gt; to use it, and &lt;em&gt;who&lt;/em&gt; gets to make the call?”  I had a customer once who had perfectly good backups – they really did – but they opened a P1 ticket asking me to walk them through the keystrokes for the restore.  I think they actually knew what to do; they were just afraid, in the moment, of typing the wrong thing.  That’s a process gap, not a tool gap, and no amount of additional automation would have fixed it.&lt;/p&gt;

&lt;p&gt;I’d add a slightly uncomfortable note here: as a vendor’s support engineer, I was always happy to help, but we probably shouldn’t be the centerpiece of anyone’s DR plan.  Support engineers can hand you tools and walk you through documentation, but we don’t know your data the way your team does, and there’s a liability we’re not really supposed to take on.  If the first time a team reads the failover documentation is during the outage, a support contract alone isn’t going to close that gap.&lt;/p&gt;

&lt;h2 id=&quot;rpo-and-rto-and-why-theyre-negotiations&quot;&gt;RPO and RTO, and Why They’re Negotiations&lt;/h2&gt;

&lt;p&gt;You can’t really talk about recovery without talking about RPO and RTO, so let me do that briefly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RPO&lt;/strong&gt; (Recovery Point Objective) is roughly “how much data are we willing to lose?”  Do we restore from last night’s backup and accept losing the day’s writes?  Or do we replay WAL and try to get as close as we can to the moment of the outage?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RTO&lt;/strong&gt; (Recovery Time Objective) is “how long are we allowed to be down before we’re considered back up?”&lt;/p&gt;

&lt;p&gt;Every choice on either of these axes is a trade-off – against cost, against complexity, against operational burden, against acceptable business loss.  And the reality is that during an outage, you really are losing business; transactions don’t happen, shopping carts don’t get checked out, customers get frustrated.  At the same time, getting up faster usually means accepting more data loss, or paying more for the infrastructure to avoid it.&lt;/p&gt;

&lt;p&gt;It’s helpful to think about RPO in tiers:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;A &lt;strong&gt;24-hour RPO&lt;/strong&gt; is basically “restore last night’s backup.”  One or two people can usually handle it, the moving parts are simple, and the data loss can be substantial.  That’s fine for some workloads.  It’s not really acceptable for high-traffic services where 24 hours of writes is a lot.&lt;/li&gt;
  &lt;li&gt;A &lt;strong&gt;15-minute RPO&lt;/strong&gt; generally means WAL archiving or shipping, monitoring to make sure none of that WAL goes missing, regular validation that you can actually restore in 15 minutes, and operational discipline around retention.  That’s reasonable for many systems, but probably not acceptable for, say, a financial institution.&lt;/li&gt;
  &lt;li&gt;A &lt;strong&gt;near-zero RPO&lt;/strong&gt; typically means synchronous replication and tightly managed failover.  Now you’re dealing with latency between nodes, distributed-systems complexity, split-brain scenarios, and a much bigger operational footprint.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Lower RPO isn’t just “better.”  It’s a design and operational commitment, and that commitment costs money, time, and people’s attention.&lt;/p&gt;

&lt;p&gt;The same is true of RTO.  Driving RTO below five minutes generally requires automation and – this part is important – rehearsal.  If you hand someone a document for the first time during an actual outage, they are not going to execute it quickly, no matter how clear the document is.&lt;/p&gt;

&lt;p&gt;This is why I think RPO and RTO really need to be &lt;em&gt;negotiated&lt;/em&gt;, not just declared.  On the surface it’s almost a no-brainer – of course everyone wants an RPO of zero and an RTO of seconds.  But when you actually go to leadership and lay out what those numbers cost, you tend to find out pretty quickly where their priorities really sit.  In a lot of cases, they’d rather spend that money on something that looks more directly tied to the business – a new feature, a marketing push, another engineer on the product team – and they’re willing to accept a softer RPO or RTO in exchange.  That’s a legitimate answer; it just needs to be made explicitly, instead of being assumed one way or the other by the infrastructure team.&lt;/p&gt;

&lt;h2 id=&quot;three-layers-of-dr-planning&quot;&gt;Three Layers of DR Planning&lt;/h2&gt;

&lt;p&gt;When I think about what a DR plan needs to cover, I find it useful to break it into three layers.&lt;/p&gt;

&lt;p&gt;The first layer is &lt;strong&gt;infrastructure failure&lt;/strong&gt;.  This is the one most teams think of first: a region goes down, storage fails, a corruption bug bites, credentials leak, somebody accidentally deletes a table, replication breaks.  Hardware and platform behaving badly.&lt;/p&gt;

&lt;p&gt;The second is &lt;strong&gt;procedural failure&lt;/strong&gt;.  Even if the infrastructure problem is well-understood, you can still fail recovery because the procedure is wrong.  Maybe the sequence values weren’t included in the backup and you didn’t realize.  Maybe the runbook references a CNAME nobody can find the host for anymore.  We used to have a setup at Turnitin where, on every failover, we had to repoint a CNAME to the new primary, and we eventually realized that nobody had documented which CNAME pointed to which underlying host.  Maybe the validation step is vague.  Procedural failure tends to be invisible until the moment you actually need the procedure.&lt;/p&gt;

&lt;p&gt;The third is &lt;strong&gt;human failure&lt;/strong&gt;.  People behave differently under duress.  Some panic.  Some zone in so hard on one screen that they miss the bigger picture.  There are conflicting instructions between managers, between teams, between people trying to be helpful.  There’s the 3AM call where the on-call is barely awake and not entirely sure what’s going on.  And there’s the person who can’t wait for the process and decides to just do something heroic and fast – which sometimes works, and sometimes makes things significantly worse.&lt;/p&gt;

&lt;p&gt;To make the layers concrete: I had a 3AM incident at Turnitin once where we rolled out a change in the evening and got paged a few hours later.  The disk had filled, and the filesystem ended up unmounted.  That was the infrastructure failure.  In the scramble to bring it back, somebody tried to remount it as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ext4&lt;/code&gt; instead of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;xfs&lt;/code&gt; – that was strike one, a procedural failure, because the runbook didn’t make the filesystem type explicit.  Then we sat for a while waiting on the CTO, because nobody on the bridge had clear authority to call any of the next steps – strike two, no incident commander.  And then somebody prematurely brought the web servers back up before the database was really healthy, causing a second round of errors – strike three, the hero move.  No single one of those was catastrophic; together they turned a one-hour problem into a much longer night.  That’s what the three layers look like in practice.&lt;/p&gt;

&lt;h2 id=&quot;recovery-isnt-always-about-failing-over&quot;&gt;Recovery Isn’t Always About Failing Over&lt;/h2&gt;

&lt;p&gt;A lot of DR talks (and a lot of DR vendors) make it sound like “recovery” basically means “fail over to the standby.”  That’s one tool in the box, but it’s nowhere near the whole box.&lt;/p&gt;

&lt;p&gt;Here’s a story that’s stuck with me.  I was on a small team that shipped a release, and the migration looked clean – everything came up, the smoke tests passed, we went home feeling pretty good.  Later, somebody noticed that the application code had a small typo in its SQL: an extra apostrophe was getting written into every comment in a comment thread.  The data wasn’t lost.  The system was up.  But the data was &lt;em&gt;wrong&lt;/em&gt;, and it kept getting more wrong every minute the application stayed online.&lt;/p&gt;

&lt;p&gt;In that particular case, a careful &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;UPDATE&lt;/code&gt; across the table was probably the right call, with all the locking and performance impact that implies.  But if you change the details a little – say the corruption is medical records, or it isn’t discovered for a few days, or some of those rows have already been read by other systems and propagated outward – a simple &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;UPDATE&lt;/code&gt; stops being the answer.  Now you’re asking whether you have enough WAL retained to do point-in-time recovery, whether you can safely update some rows in place, when exactly the corruption started, and so on.&lt;/p&gt;

&lt;p&gt;I bring it up because that scenario is just as much a disaster, and just as worth planning for, as a disk failing or a region going dark.  And it can’t be solved by failing over – the standby would have the bad data too.&lt;/p&gt;

&lt;p&gt;While I’m telling stories about quietly-bad situations: another underrated failure mode is “the engineer who knew this part of the system went on vacation,” or quit, or moved teams, and the documentation never quite got updated.  Real DR plans have to assume some of that, too.&lt;/p&gt;

&lt;h2 id=&quot;what-lower-numbers-actually-cost&quot;&gt;What Lower Numbers Actually Cost&lt;/h2&gt;

&lt;p&gt;Negotiating RPO and RTO sounds abstract until you start listing the consequences.  Wanting an RPO of zero pushes you toward synchronous replication and forces you to live with the latency that comes with it.  Wanting an RTO of under five minutes pushes you toward automation that has to be built, tested, and maintained, and toward rehearsal cadence that has to be on someone’s calendar.  Multi-region pushes operational complexity up significantly – you’ve got clusters in different regions talking to each other, you’ve got cross-region replication lag to tolerate, and now your monitoring story has to account for all of it.  Even something as innocuous as “we’d like to be able to do point-in-time recovery to any second over the last 30 days” can mean keeping terabytes of WAL around and paying for storage you barely look at.&lt;/p&gt;

&lt;p&gt;None of this is a reason not to do these things.  It’s just a reason to have honest conversations about which of them you actually need.&lt;/p&gt;

&lt;h2 id=&quot;to-be-continued&quot;&gt;To Be Continued&lt;/h2&gt;

&lt;p&gt;That covers what I think of as the framing half of the talk: what counts as a disaster, why preparation and prevention aren’t the same as recovery, and how RPO and RTO end up being negotiations rather than declarations.&lt;/p&gt;

&lt;p&gt;In two weeks, I’ll get into the part that I think could reduce RTO (something that can’t be replaced by AI): runbook engineering, game days, what to measure, and the cultural piece that holds it all together.&lt;/p&gt;

</description>
        <pubDate>Mon, 15 Jun 2026 08:00:00 +0000</pubDate>
        <link>http://richyen.com/postgres/2026/06/15/disaster_recovery_is_a_process.html</link>
        <guid isPermaLink="true">http://richyen.com/postgres/2026/06/15/disaster_recovery_is_a_process.html</guid>
        
        <category>PostgreSQL</category>
        
        <category>postgres</category>
        
        <category>disaster-recovery</category>
        
        <category>dr</category>
        
        <category>rto</category>
        
        <category>rpo</category>
        
        <category>high-availability</category>
        
        <category>operations</category>
        
        
        <category>postgres</category>
        
      </item>
    
      <item>
        <title>PGDay Boston 2026</title>
        <description>&lt;h2 id=&quot;introduction&quot;&gt;Introduction&lt;/h2&gt;

&lt;p&gt;PGDay Boston 2026 was a rewarding reminder of why I value the PostgreSQL community so much. It was delightful to reconnect with familiar faces, meet new people, and finally put some faces to names for the first time. One of the best parts of the day was the sense that this community is larger than any one employer or project. It is built on shared curiosity, shared responsibility, and a willingness to help one another learn.  I’m honored to have been able to share my own thoughts in my Disaster Recovery talk as well.&lt;/p&gt;

&lt;p&gt;The keynote, Michael Stonebraker’s “Where Did Postgres Come From?”, was a standout for me. I especially appreciated the history of Postgres and the years before Postgres, during the Ingres era. It was striking to hear how the project could have ended up as just another academic system, yet instead grew into something enduring because people outside of UC Berkeley took ownership of it and built a broader community around it. That story felt like a good reminder that open source succeeds not only through technical merit, but through stewardship and continuity.&lt;/p&gt;

&lt;p&gt;I also enjoyed Brian Brennglass’s talk, “Managing and Observing Locks.” His demos made an intimidating topic much easier to follow, and I found the practical framing especially useful. Shree Vidhya Sampath’s session on leveraging Patroni’s synchronous replication while running PostgreSQL on Kubernetes was another highlight. I appreciated the clear discussion of election behavior, synchronous replication, and failover scenarios, including failure modes I had not experimented with myself.&lt;/p&gt;

&lt;p&gt;Robert Haas’ “pg_plan_advice: Plan Stability and User Planner Control for PostgreSQL?” was impressive in his attention to detail, especially the way he tested edge cases that people might not think to check. Bruce Momjian’s “What’s Missing in Postgres?” was also thought-provoking because it framed missing features not as oversights, but often as deliberate choices shaped by the needs of the broader community. Ryan Booz’s “Mastering PostgreSQL Partitioning: Supercharge Performance and Simplify Maintenance” rounded out the day well with a useful refresher on partitioning behavior, tradeoffs, and current workarounds.&lt;/p&gt;

&lt;p&gt;Overall, the event benefited me in both personal and professional ways. Professionally, it deepened my understanding of PostgreSQL internals, operational patterns, and ecosystem tooling. Personally, it renewed my appreciation for the community that has grown around Postgres and the care that goes into keeping it healthy. Thank you Tom Kincaid, Ken Rugg, Erik Pohi, Greg Burd, Kanchan Mohitey, Shihao Zhong, and so many others – along with PGUS – who worked hard to make the first PGDay Boston a smashing success! I look forward to staying involved and attending future events in Boston and beyond.&lt;/p&gt;
</description>
        <pubDate>Wed, 10 Jun 2026 08:00:00 +0000</pubDate>
        <link>http://richyen.com/postgres/2026/06/10/pgday_boston_2026_writeup.html</link>
        <guid isPermaLink="true">http://richyen.com/postgres/2026/06/10/pgday_boston_2026_writeup.html</guid>
        
        <category>PostgreSQL</category>
        
        <category>postgres</category>
        
        <category>pgday</category>
        
        <category>boston</category>
        
        <category>conference</category>
        
        <category>community</category>
        
        
        <category>postgres</category>
        
      </item>
    
      <item>
        <title>Foreign Tables and Materialized Views: A Dynamic Duo</title>
        <description>&lt;h2 id=&quot;introduction&quot;&gt;Introduction&lt;/h2&gt;

&lt;p&gt;I recently wrote a post about &lt;a href=&quot;/postgres/2026/04/06/wal_archiving.html&quot;&gt;WAL log shipping&lt;/a&gt; and how a standby built on log shipping is a great way to give data analysts production data without putting the primary at risk.  Having access to the production data in this way is great, but it’s read-only.  How can we create views of this data for better analytics work?  I want to make the case today that Foreign Data Wrappers and Materialized Views can make a great solution – not only in accessing production Postgres data, but also working with other data sources.&lt;/p&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;moving-beyond-fdw-demos&quot;&gt;Moving Beyond FDW Demos&lt;/h2&gt;

&lt;p&gt;Most people meet foreign data wrappers (FDWs) through a quick demo, and I’ve &lt;a href=&quot;https://speakerdeck.com/richyen/2023-pgday-chicago-fdw&quot;&gt;highlighted some of their features in previous conference talks&lt;/a&gt;.  There is high novelty in being able to query MySQL from Postgres, but the reality is often that the latency between the local database and the foreign table can be pretty high.  Sometimes, predicate push-down isn’t what you’d expect, and indexing may not be very transparent.  In the end, setting up and managing FDWs may seem more work than it’s worth, and that’s a mistake.  Used correctly, foreign tables are one of the most practical tools for &lt;strong&gt;analytics across heterogeneous data sources&lt;/strong&gt; – especially when paired with materialized views.&lt;/p&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;the-real-problem-heterogeneous-data&quot;&gt;The Real Problem: Heterogeneous Data&lt;/h2&gt;

&lt;p&gt;Modern data rarely lives in one place:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Legacy systems in MySQL&lt;/li&gt;
  &lt;li&gt;Operational data in PostgreSQL&lt;/li&gt;
  &lt;li&gt;Flat files sitting in object storage (I’ve seen people do this with AWS Athena)&lt;/li&gt;
  &lt;li&gt;Maybe even some CSVs someone refuses to migrate&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Foreign tables give you a unified SQL interface, but under the hood, the query performance can be unpredictable as you may be forced to rely on another engine’s query planner (and in the case of that CSV data source, it might not even be indexed).&lt;/p&gt;

&lt;p&gt;In other words, FDWs optimize developer experience, not query performance.&lt;/p&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;the-pattern-fdw--materialized-views&quot;&gt;The Pattern: FDW + Materialized Views&lt;/h2&gt;

&lt;p&gt;Instead of querying foreign tables directly in analytics workloads, we can opt to use FDWs as ingestion points, not as the serving layer itself.  To achieve this, we can do the following:&lt;/p&gt;

&lt;h3 id=&quot;step-1-define-the-foreign-table&quot;&gt;Step 1: Define the foreign table&lt;/h3&gt;

&lt;div class=&quot;language-sql highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;FOREIGN&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;TABLE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ext_orders&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;id&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;bigint&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;customer_id&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;bigint&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;total&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;numeric&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;created_at&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;timestamp&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;SERVER&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;mysql_server&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;OPTIONS&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;table&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;orders&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;step-2-build-a-materialized-view&quot;&gt;Step 2: Build a materialized view&lt;/h3&gt;

&lt;div class=&quot;language-sql highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;MATERIALIZED&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;VIEW&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;orders_mv&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AS&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;id&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;customer_id&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;total&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;created_at&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;date&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;order_date&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ext_orders&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;step-3-index-it-like-a-real-table&quot;&gt;Step 3: Index it like a real table&lt;/h3&gt;

&lt;div class=&quot;language-sql highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;INDEX&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;ON&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;orders_mv&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;order_date&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;INDEX&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;ON&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;orders_mv&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;customer_id&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Now we’ve turned a slow, remote dataset into a locally optimized analytical structure.&lt;/p&gt;

&lt;p&gt;The materialized view lives inside PostgreSQL, supports full indexing, eliminates network latency during queries, and gives predictable performance.  We essentially have a read-optimized cache on top of the foreign tables.  We can do this with the read-only Postgres replicas as well, to slice up the columns and rows to fit nicely in a view that analysts would want to use.&lt;/p&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;refreshing-without-blocking&quot;&gt;Refreshing Without Blocking&lt;/h2&gt;

&lt;p&gt;When it comes to caching, data gets stale, and we’re sort of back at the same problem every ETL pipeline faces.  However, Postgres can refresh a materialized view without blocking users, simply with the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CONCURRENTLY&lt;/code&gt; syntax.  This results in production-quality data with a little bit of staleness, but the nice thing is that it’s all built-in to the Postgres cluster (no separate ETL pipeline to manage, just all the data accessible from one central place).  Note, however that in order to use the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CONCURRENTLY&lt;/code&gt; syntax, &lt;a href=&quot;https://www.postgresql.org/docs/current/sql-refreshmaterializedview.html&quot;&gt;a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;UNIQUE&lt;/code&gt; key is required&lt;/a&gt;.&lt;/p&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;good-applications-for-the-pairing&quot;&gt;Good Applications for the Pairing&lt;/h2&gt;

&lt;p&gt;The pairing of FDWs and indexed Materialized Views could be very beneficial in a handful of use cases:&lt;/p&gt;

&lt;h3 id=&quot;1-poorly-indexed-remote-systems&quot;&gt;1. Poorly Indexed Remote Systems&lt;/h3&gt;

&lt;p&gt;If your upstream system:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Lacks proper indexes&lt;/li&gt;
  &lt;li&gt;Is shared with OLTP workloads&lt;/li&gt;
  &lt;li&gt;Is not under your control&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This approach isolates analytics from those constraints.&lt;/p&gt;

&lt;h3 id=&quot;2-high-latency-data-sources&quot;&gt;2. High-Latency Data Sources&lt;/h3&gt;

&lt;p&gt;Examples:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Cross-region databases&lt;/li&gt;
  &lt;li&gt;Cloud object storage via FDWs&lt;/li&gt;
  &lt;li&gt;Athena-backed datasets&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Instead of paying the latency cost on every query, you pay it once per refresh.&lt;/p&gt;

&lt;h3 id=&quot;3-flat-files-and-large-data&quot;&gt;3. Flat Files and Large Data&lt;/h3&gt;

&lt;p&gt;Yes, people do this:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Querying CSVs via FDWs&lt;/li&gt;
  &lt;li&gt;Treating object storage as a “database”&lt;/li&gt;
  &lt;li&gt;Large JSONB sets that are hard to index well&lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;final-thoughts&quot;&gt;Final Thoughts&lt;/h2&gt;

&lt;p&gt;Foreign tables aren’t just novelty – they’re a powerful bridge across messy, real-world data systems.&lt;/p&gt;

&lt;p&gt;It is important to distinguish that FDWs make a &lt;strong&gt;data access layer&lt;/strong&gt;, while Materialized Views are the &lt;strong&gt;analytics engine&lt;/strong&gt;.  If you 1) layer materialized views on top of FDWs, 2) add proper indexing, and 3) refresh intelligently (preferably concurrently), you can get the best of both worlds: flexibility of federated queries and performance of local analytics.&lt;/p&gt;
</description>
        <pubDate>Mon, 25 May 2026 08:00:00 +0000</pubDate>
        <link>http://richyen.com/postgres/2026/05/25/fdw_mv_analytics.html</link>
        <guid isPermaLink="true">http://richyen.com/postgres/2026/05/25/fdw_mv_analytics.html</guid>
        
        
        <category>postgres</category>
        
      </item>
    
      <item>
        <title>XID Wraparound&apos;s Equally-Evil Twin</title>
        <description>&lt;h2 id=&quot;introduction&quot;&gt;Introduction&lt;/h2&gt;

&lt;p&gt;If you’ve been running PostgreSQL for any length of time, you’ve probably heard about transaction ID (XID) wraparound.  It’s one of the most well-known maintenance concerns in Postgres, and there’s no shortage of blog posts, conference talks, and war stories about it.  But there’s a quieter, less-discussed cousin that can cause the exact same kind of outage: &lt;strong&gt;MultiXact ID wraparound&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;I’ve seen this surprise more than a few experienced DBAs.  They’ve got their autovacuum tuned, they’re monitoring &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;age(datfrozenxid)&lt;/code&gt;, and they’re feeling good – and then out of nowhere, Postgres starts refusing certain writes because it’s approaching MultiXact ID wraparound.&lt;/p&gt;

&lt;p&gt;The fix is the same as regular XID wraparound – a simple vacuum.  But the reason is different, and understanding it can help you keep your monitoring complete.&lt;/p&gt;

&lt;h2 id=&quot;whats-a-multixact-id&quot;&gt;What’s a MultiXact ID?&lt;/h2&gt;

&lt;p&gt;In Postgres, every row has a system column called &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;xmax&lt;/code&gt;.  In the simplest case, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;xmax&lt;/code&gt; holds the transaction ID of the transaction that deleted or updated the row.  But what happens when &lt;em&gt;multiple&lt;/em&gt; transactions hold locks on the same row at the same time?&lt;/p&gt;

&lt;p&gt;Consider &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SELECT ... FOR SHARE&lt;/code&gt;.  Multiple transactions can hold a shared lock on the same row concurrently.  Postgres needs to record &lt;em&gt;all&lt;/em&gt; of those transactions somewhere, but &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;xmax&lt;/code&gt; is only wide enough to store a single transaction ID.  The solution is the &lt;strong&gt;MultiXact&lt;/strong&gt; mechanism.&lt;/p&gt;

&lt;p&gt;A MultiXact ID is essentially a pointer into a separate structure (stored as a file in the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pg_multixact/&lt;/code&gt; dir) that maps to a &lt;em&gt;list&lt;/em&gt; of transaction IDs and their lock modes.  When multiple transactions need to lock a row, Postgres:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Allocates a new MultiXact ID&lt;/li&gt;
  &lt;li&gt;Records the set of transaction IDs (and their lock types) in the MultiXact member data&lt;/li&gt;
  &lt;li&gt;Stores the MultiXact ID in the row’s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;xmax&lt;/code&gt; field, with a flag (specifically, the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;HEAP_XMAX_IS_MULTI&lt;/code&gt; infomask bit in the tuple header) indicating it’s a multi-xact reference rather than a plain XID&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This lets the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;xmax&lt;/code&gt; field stay a fixed 32-bit value while still representing an arbitrary number of concurrent row-level lockers.&lt;/p&gt;

&lt;h2 id=&quot;when-are-multixact-ids-created&quot;&gt;When Are MultiXact IDs Created?&lt;/h2&gt;

&lt;p&gt;MultiXact IDs come into play in several scenarios:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SELECT ... FOR SHARE&lt;/code&gt;&lt;/strong&gt; – The classic case.  Multiple transactions can hold shared row locks simultaneously.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SELECT ... FOR KEY SHARE&lt;/code&gt;&lt;/strong&gt; – Used implicitly by foreign key checks.  If you have a parent table with foreign key references, every insert or update on the child table takes a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;FOR KEY SHARE&lt;/code&gt; lock on the referenced parent row.  On a busy system with many concurrent inserts referencing the same parent rows, this generates MultiXact IDs rapidly.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Combination locks&lt;/strong&gt; – If one transaction holds a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;FOR KEY SHARE&lt;/code&gt; lock and another holds a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;FOR NO KEY UPDATE&lt;/code&gt; lock on the same row, the two locks don’t conflict, and the resulting multi-lock is stored as a MultiXact.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The foreign key scenario is particularly noteworthy because it’s &lt;em&gt;invisible&lt;/em&gt; to most application developers.  You won’t see any queries explicitly calling out &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;FOR SHARE&lt;/code&gt; in application code, but Postgres is silently creating MultiXact IDs behind the scenes to manage the implicit locks.&lt;/p&gt;

&lt;h2 id=&quot;multixact-ids-need-freezing-too&quot;&gt;MultiXact IDs Need Freezing Too!&lt;/h2&gt;

&lt;p&gt;Just like transaction IDs, MultiXact IDs are 32-bit counters.  And just like XIDs, they wrap around.  Postgres can only “see” about 2 billion MultiXact IDs into the past.  If a row still references a MultiXact ID that’s about to fall off the visible horizon, Postgres has a problem: it can no longer determine whether the locks represented by that MultiXact are still relevant.&lt;/p&gt;

&lt;p&gt;To prevent this, Postgres needs to &lt;strong&gt;freeze&lt;/strong&gt; MultiXact IDs, just as it freezes regular XIDs.  Freezing a MultiXact means replacing the MultiXact reference in the row’s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;xmax&lt;/code&gt; with either the zero value, a single transaction ID, or a newer multixact ID, depending on whether the lock information is still meaningful.&lt;/p&gt;

&lt;p&gt;The relevant settings mirror those for XID freezing:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;XID Setting&lt;/th&gt;
      &lt;th&gt;MultiXact Equivalent&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;vacuum_freeze_min_age&lt;/code&gt;&lt;/td&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;vacuum_multixact_freeze_min_age&lt;/code&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;vacuum_freeze_table_age&lt;/code&gt;&lt;/td&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;vacuum_multixact_freeze_table_age&lt;/code&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;autovacuum_freeze_max_age&lt;/code&gt;&lt;/td&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;autovacuum_multixact_freeze_max_age&lt;/code&gt;&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;When the MultiXact age of a table exceeds &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;autovacuum_multixact_freeze_max_age&lt;/code&gt;, autovacuum will trigger an aggressive (whole-table) vacuum specifically to freeze old MultiXact IDs – even if the table has no dead tuples and wouldn’t otherwise qualify for autovacuum.&lt;/p&gt;

&lt;h2 id=&quot;dont-let-multixact-fly-under-the-radar&quot;&gt;Don’t Let MultiXact Fly Under the Radar&lt;/h2&gt;

&lt;p&gt;The query is straightforward:&lt;/p&gt;

&lt;div class=&quot;language-sql highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;datname&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
       &lt;span class=&quot;n&quot;&gt;age&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;datfrozenxid&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;xid_age&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
       &lt;span class=&quot;n&quot;&gt;mxid_age&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;datminmxid&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;mxid_age&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pg_database&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;ORDER&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;BY&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;mxid_age&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;DESC&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;For per-table granularity:&lt;/p&gt;

&lt;div class=&quot;language-sql highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;oid&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;regclass&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;table_name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
       &lt;span class=&quot;n&quot;&gt;age&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;relfrozenxid&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;xid_age&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
       &lt;span class=&quot;n&quot;&gt;mxid_age&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;relminmxid&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;mxid_age&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pg_class&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;c&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;WHERE&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;relkind&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;IN&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;&apos;r&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;t&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;m&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;ORDER&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;BY&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;mxid_age&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;DESC&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;LIMIT&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;20&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Keep an eye on any table where &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;mxid_age&lt;/code&gt; is approaching &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;autovacuum_multixact_freeze_max_age&lt;/code&gt; (default: 400 million).  If it gets close, autovacuum &lt;em&gt;should&lt;/em&gt; kick in, but on large tables or systems with constrained autovacuum workers, it may not complete in time.&lt;/p&gt;

&lt;h2 id=&quot;practical-recommendations&quot;&gt;Practical Recommendations&lt;/h2&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Add MultiXact monitoring alongside XID monitoring.&lt;/strong&gt;  If your alerting triggers at, say, 500 million XID age, add a similar alert for MultiXact age.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Watch your foreign key parent tables.&lt;/strong&gt;  If you have a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;users&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;accounts&lt;/code&gt; table that’s referenced by every other table in the schema, it’s likely accumulating MultiXact IDs faster than you’d expect.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Consider &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;autovacuum_multixact_freeze_max_age&lt;/code&gt; tuning.&lt;/strong&gt;  The default of 400 million is higher than the XID &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;autovacuum_freeze_max_age&lt;/code&gt; default of 200 million.  But in workloads with heavy foreign key activity, you may want to lower it – or configure per-table autovacuum settings on hot parent tables.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Don’t ignore “unnecessary” vacuums.&lt;/strong&gt;  If you see autovacuum running on a table that has zero dead tuples, don’t assume it’s wasting resources.  It may be performing MultiXact freezing work that’s critical for preventing wraparound.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;MultiXact ID wraparound is the kind of problem that bites you precisely because you didn’t know to look for it.  The mechanism exists for a good reason – efficiently tracking shared row locks is fundamental to Postgres’s concurrency model.  But the maintenance burden it creates is real, and it demands the same vigilance as XID wraparound.&lt;/p&gt;

&lt;p&gt;If you take one thing away from this post: go check &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;mxid_age(datminmxid)&lt;/code&gt; on your databases today.  If you’ve never looked at it before, now’s a good time to start.&lt;/p&gt;
</description>
        <pubDate>Mon, 18 May 2026 06:00:00 +0000</pubDate>
        <link>http://richyen.com/postgres/2026/05/18/multixact_wraparound.html</link>
        <guid isPermaLink="true">http://richyen.com/postgres/2026/05/18/multixact_wraparound.html</guid>
        
        <category>PostgreSQL</category>
        
        <category>postgres</category>
        
        <category>vacuum</category>
        
        <category>multixact</category>
        
        <category>wraparound</category>
        
        <category>maintenance</category>
        
        <category>autovacuum</category>
        
        
        <category>postgres</category>
        
      </item>
    
      <item>
        <title>Making JSONB More Queryable with Generated Columns</title>
        <description>&lt;h2 id=&quot;introduction&quot;&gt;Introduction&lt;/h2&gt;

&lt;p&gt;Over the past year, I’ve worked in a handful of contexts managing large volumes of data stored as JSONB in PostgreSQL. The scenario is common: users appreciate the flexibility of a document-oriented storage model, avoiding the need to predefine schemas or constantly migrate table structures as their data requirements evolve. JSONB documents can be deeply nested with numerous optional fields, and they scale to hundreds of kilobytes per record without issue. However, when the time comes to query these documents – filtering by user ID, event type, timestamps, or nested action properties – the queries can become slow and/or cumbersome to work with.&lt;/p&gt;

&lt;p&gt;The problem I want to address is: “How do we make searching JSONB data more efficient without breaking apart our documents or forcing it into columns in a relational database?” There are several approaches available in Postgres, each with different tradeoffs. I hope to shed some light on those approaches in this article.&lt;/p&gt;

&lt;h2 id=&quot;the-setup&quot;&gt;The Setup&lt;/h2&gt;

&lt;p&gt;I created a basic, no-frills table for the sake of this test:&lt;/p&gt;

&lt;div class=&quot;language-sql highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;TABLE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;events&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;id&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;BIGSERIAL&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;PRIMARY&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;KEY&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;data&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;JSONB&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;NOT&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;NULL&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;Here&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;&apos;s the document shape I used for testing and writing this post -- it&apos;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;representative&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;of&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;the&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;event&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;logs&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;and&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;audit&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;trails&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;I&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;&apos;ve encountered: a mix of primitive fields, nested objects, and metadata that accumulates over time.

-- Representative JSONB document
{
  &quot;user_id&quot;: 5234,
  &quot;event_type&quot;: &quot;event_42&quot;,
  &quot;timestamp&quot;: 1712341200,
  &quot;session_id&quot;: &quot;sess_abc123...&quot;,
  &quot;ip_address&quot;: &quot;192.168.1.42&quot;,
  &quot;action&quot;: {
    &quot;type&quot;: &quot;click&quot;,
    &quot;target_id&quot;: 87654,
    &quot;coordinates&quot;: {&quot;x&quot;: 512, &quot;y&quot;: 768},
    &quot;duration_ms&quot;: 1234
  },
  &quot;device&quot;: {
    &quot;type&quot;: &quot;mobile&quot;,
    &quot;os&quot;: &quot;iOS&quot;,
    &quot;screen_width&quot;: 1920,
    &quot;screen_height&quot;: 1080
  },
  &quot;performance&quot;: {
    &quot;page_load_time&quot;: 1234,
    &quot;dns_lookup&quot;: 123,
    &quot;tcp_connection&quot;: 234,
    &quot;server_response&quot;: 876
  },
  &quot;custom_fields&quot;: { ... }
}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The queries that matter are straightforward equality and range filters on known fields: find all events for a given user, filter by event type, narrow to a time window. With this setup, we’ll try to discern which kind of index actually serves the specific access pattern, and what the real cost of each option is.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;All tests run on PostgreSQL 18.2 in Docker on an Apple M-series host. Tables contain 50,000 rows with realistic JSONB event documents. Query benchmarks run 20 times on a warm cache and report avg/min/max. Insert benchmarks run 5 trials of 5,000 rows each. Schema and scripts are included throughout so you can reproduce these results.&lt;/em&gt;&lt;/p&gt;

&lt;h2 id=&quot;three-approaches-to-indexing-jsonb&quot;&gt;Three Approaches to Indexing JSONB&lt;/h2&gt;

&lt;p&gt;There are three realistic options for this access pattern. Let’s look at each in turn – what it costs to build/maintain, what queries it actually helps, and where it falls down.&lt;/p&gt;

&lt;h3 id=&quot;option-1-gin-indexes&quot;&gt;Option 1: GIN Indexes&lt;/h3&gt;

&lt;p&gt;The natural candidate for indexing a JSONB column would be a GIN (Generalized Inverted Index) index.  After all, GIN indexes are specifically designed for JSON documents and full-text search.  It indexes every key and value pair in every document, making the entire structure searchable:&lt;/p&gt;

&lt;div class=&quot;language-sql highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;INDEX&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;idx_gin&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;ON&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;events&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;USING&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;GIN&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;-- or the path-only variant:&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;INDEX&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;idx_gin_path&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;ON&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;events&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;USING&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;GIN&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;data&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;jsonb_path_ops&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;As a refresher, I’ll mention that GIN is designed for containment and key existence operators (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;@&amp;gt;&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;?&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;?|&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;?&amp;amp;&lt;/code&gt;), not for equality on extracted fields:&lt;/p&gt;

&lt;div class=&quot;language-sql highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;-- This query uses a GIN index correctly:&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;id&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;events&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;WHERE&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;data&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;@&amp;gt;&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;{&quot;user_id&quot;: 5234}&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;-- This query does NOT use a GIN index, even if one exists:&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;id&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;events&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;WHERE&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;cast&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;&apos;user_id&apos;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;INT&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;5234&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;For the containment form, the GIN index is used and the query is fast – but still slower than a B-tree on the same field, because GIN lookups involve more bookkeeping:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;-- GIN jsonb_ops + containment operator
Bitmap Index Scan on idx_gin
  Index Cond: (data @&amp;gt; &apos;{&quot;user_id&quot;: 5234}&apos;)

lanning Time: 1.173 ms  |  Execution Time: 1.295 ms

-- GIN jsonb_path_ops + containment operator
Bitmap Index Scan on idx_gin_path
  Index Cond: (data @&amp;gt; &apos;{&quot;user_id&quot;: 5234}&apos;)
Planning Time: 3.342 ms  |  Execution Time: 0.450 ms
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;jsonb_path_ops&lt;/code&gt; variant is smaller and faster for containment queries, but it trades away support for key-existence operators (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;?&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;?|&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;?&amp;amp;&lt;/code&gt;). Neither GIN variant can help with range predicates like &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ts &amp;gt; 1700000000&lt;/code&gt; – those always fall through to a filter step.&lt;/p&gt;

&lt;h3 id=&quot;option-2-expression-indexes&quot;&gt;Option 2: Expression Indexes&lt;/h3&gt;

&lt;p&gt;Postgres lets you create an index on an expression, including JSONB extraction:&lt;/p&gt;

&lt;div class=&quot;language-sql highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;INDEX&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;idx_user_id&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;ON&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;events&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;cast&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;&apos;user_id&apos;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;INT&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;));&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This is a B-tree index on the &lt;em&gt;result&lt;/em&gt; of evaluating the expression. When the query predicate matches the indexed expression exactly, and after &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ANALYZE&lt;/code&gt; has gathered statistics on it, the planner will use it:&lt;/p&gt;

&lt;div class=&quot;language-sql highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;id&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;events&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;WHERE&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;cast&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;&apos;user_id&apos;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;INT&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;5234&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Bitmap Heap Scan on t_expr
  Recheck Cond: ((data -&amp;gt;&amp;gt; &apos;user_id&apos;)::integer = 5234)
  Heap Blocks: exact=3
  -&amp;gt;  Bitmap Index Scan on idx_user_id
        Index Cond: ((data -&amp;gt;&amp;gt; &apos;user_id&apos;)::integer = 5234)
Planning Time: 1.168 ms  |  Execution Time: 0.341 ms
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The execution time on this equality operator seems to be pretty similar to the performance of the GIN index.&lt;/p&gt;

&lt;h3 id=&quot;option-3-generated-columns&quot;&gt;Option 3: Generated Columns&lt;/h3&gt;

&lt;p&gt;Generated columns (available since PostgreSQL 12) let you extract JSONB values into regular typed columns at write time. The values are stored physically alongside the row and kept in sync automatically:&lt;/p&gt;

&lt;div class=&quot;language-sql highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;TABLE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;events&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;id&lt;/span&gt;         &lt;span class=&quot;n&quot;&gt;BIGSERIAL&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;PRIMARY&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;KEY&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;data&lt;/span&gt;       &lt;span class=&quot;n&quot;&gt;JSONB&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;NOT&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;NULL&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;user_id&lt;/span&gt;    &lt;span class=&quot;nb&quot;&gt;INT&lt;/span&gt;    &lt;span class=&quot;k&quot;&gt;GENERATED&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ALWAYS&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;&apos;user_id&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)::&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;INT&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;STORED&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;event_type&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;TEXT&lt;/span&gt;   &lt;span class=&quot;k&quot;&gt;GENERATED&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ALWAYS&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;&apos;event_type&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;STORED&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;ts&lt;/span&gt;         &lt;span class=&quot;nb&quot;&gt;BIGINT&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;GENERATED&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ALWAYS&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;&apos;timestamp&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)::&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;BIGINT&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;STORED&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;action&lt;/span&gt;     &lt;span class=&quot;nb&quot;&gt;TEXT&lt;/span&gt;   &lt;span class=&quot;k&quot;&gt;GENERATED&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ALWAYS&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;&apos;action&apos;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;&apos;type&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;STORED&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;INDEX&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;idx_user_id&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;ON&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;events&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;user_id&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;INDEX&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;idx_event_type&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;ON&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;events&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;event_type&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;INDEX&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;idx_ts&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;ON&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;events&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ts&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;INDEX&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;idx_action&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;ON&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;events&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;action&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Queries against generated columns are plain typed-column lookups. The planner sees them as regular B-tree columns and produces tight estimates:&lt;/p&gt;

&lt;div class=&quot;language-sql highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;id&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;events&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;WHERE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;user_id&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;5234&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Bitmap Heap Scan on t_gen
  Recheck Cond: (user_id = 5234)
  Heap Blocks: exact=3
  -&amp;gt;  Bitmap Index Scan on idx_user_id
        Index Cond: (user_id = 5234)
Planning Time: 1.159 ms  |  Execution Time: 0.407 ms
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;You also get native support for range queries and composite indexes at no extra complexity – just combine columns as you normally would:&lt;/p&gt;

&lt;div class=&quot;language-sql highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;-- Indexed range query on generated timestamp column&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;INDEX&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;ON&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;events&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;event_type&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ts&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;id&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;events&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;WHERE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;event_type&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;event_42&apos;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AND&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ts&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1700000000&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;-- Execution Time: 0.698 ms (vs 6.6 ms with GIN + post-filter)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;side-by-side-query-performance&quot;&gt;Side-by-Side: Query Performance&lt;/h2&gt;

&lt;p&gt;With all three approaches set up, here are the warm-cache query results averaged over 20 runs for an equality filter on &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;user_id&lt;/code&gt;:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Approach&lt;/th&gt;
      &lt;th&gt;Avg (ms)&lt;/th&gt;
      &lt;th&gt;Min (ms)&lt;/th&gt;
      &lt;th&gt;Max (ms)&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;GIN jsonb_ops + &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;@&amp;gt;&lt;/code&gt;&lt;/td&gt;
      &lt;td&gt;0.198&lt;/td&gt;
      &lt;td&gt;0.101&lt;/td&gt;
      &lt;td&gt;1.769&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;GIN jsonb_path_ops + &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;@&amp;gt;&lt;/code&gt;&lt;/td&gt;
      &lt;td&gt;0.197&lt;/td&gt;
      &lt;td&gt;0.032&lt;/td&gt;
      &lt;td&gt;3.115&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Expression index&lt;/td&gt;
      &lt;td&gt;0.106&lt;/td&gt;
      &lt;td&gt;0.018&lt;/td&gt;
      &lt;td&gt;1.705&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Generated column B-tree&lt;/td&gt;
      &lt;td&gt;0.112&lt;/td&gt;
      &lt;td&gt;0.016&lt;/td&gt;
      &lt;td&gt;1.839&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;Expression indexes and generated columns perform very similarly for equality queries—both around 0.1ms on warm cache. The real work is done in the B-tree lookup and both produce the same index structure. GIN with the correct &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;@&amp;gt;&lt;/code&gt; operator is nearly as fast in PG 18.2 – still slightly slower than B-tree for this access pattern, but the gap has narrowed. GIN lookups still require a recheck step that B-tree lookups avoid, and the variance remains notable: GIN max of 3.1ms vs B-tree max of 1.8ms on warm cache.&lt;/p&gt;

&lt;p&gt;The more surprising result is what happens if the GIN index is present but the query is written with extraction-based equality:&lt;/p&gt;

&lt;div class=&quot;language-sql highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;-- GIN index exists, but this query gets a seq scan:&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;id&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;events&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;WHERE&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;cast&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;&apos;user_id&apos;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;INT&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;5234&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;-- Execution Time: 47.935 ms (same as no index at all)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;GIN doesn’t support that operator class. This is by far the most common confusion teams run into with JSONB indexing.&lt;/p&gt;

&lt;h2 id=&quot;the-full-cost-picture-storage-and-writes&quot;&gt;The Full Cost Picture: Storage and Writes&lt;/h2&gt;

&lt;h3 id=&quot;storage&quot;&gt;Storage&lt;/h3&gt;

&lt;p&gt;Here’s what the same 50,000 rows cost on disk under each approach:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Approach&lt;/th&gt;
      &lt;th&gt;Table size&lt;/th&gt;
      &lt;th&gt;Index size&lt;/th&gt;
      &lt;th&gt;Total&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;Expression indexes (4)&lt;/td&gt;
      &lt;td&gt;18 MB&lt;/td&gt;
      &lt;td&gt;3.5 MB&lt;/td&gt;
      &lt;td&gt;21 MB&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Generated columns + B-tree (4)&lt;/td&gt;
      &lt;td&gt;20 MB&lt;/td&gt;
      &lt;td&gt;3.5 MB&lt;/td&gt;
      &lt;td&gt;23 MB&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;GIN jsonb_path_ops&lt;/td&gt;
      &lt;td&gt;18 MB&lt;/td&gt;
      &lt;td&gt;13 MB&lt;/td&gt;
      &lt;td&gt;31 MB&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;GIN jsonb_ops&lt;/td&gt;
      &lt;td&gt;18 MB&lt;/td&gt;
      &lt;td&gt;18 MB&lt;/td&gt;
      &lt;td&gt;36 MB&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;Expression indexes and generated column B-tree indexes produce &lt;em&gt;identical&lt;/em&gt; index sizes for the same fields – this makes sense, since the index structures are the same; the only extra cost of generated columns is the 2 MB of additional stored column data in the table (~40 bytes per row for four typed columns). GIN indexes are substantially larger: 13–18 MB for a single index vs 3.5 MB for four targeted B-tree indexes. The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;jsonb_path_ops&lt;/code&gt; variant is smaller because it only stores value hashes for the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;@&amp;gt;&lt;/code&gt; operator path, but it still dwarfs the targeted approach.&lt;/p&gt;

&lt;p&gt;One caveat: these numbers reflect documents with short keys and compact values. Documents with verbose key names, deeply nested structures, or large string values will inflate GIN indexes proportionally more – because GIN indexes every key path. B-tree and expression indexes are unaffected by document verbosity, since they only store the extracted value.&lt;/p&gt;

&lt;h3 id=&quot;write-throughput&quot;&gt;Write Throughput&lt;/h3&gt;

&lt;p&gt;Here’s what 5,000 INSERTs per trial, 5 trials each, on a table already containing 50,000 rows looked like:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Approach&lt;/th&gt;
      &lt;th&gt;Avg (ms)&lt;/th&gt;
      &lt;th&gt;Min (ms)&lt;/th&gt;
      &lt;th&gt;Max (ms)&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;Generated columns + B-tree (4)&lt;/td&gt;
      &lt;td&gt;157&lt;/td&gt;
      &lt;td&gt;91&lt;/td&gt;
      &lt;td&gt;317&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Expression indexes (4)&lt;/td&gt;
      &lt;td&gt;163&lt;/td&gt;
      &lt;td&gt;93&lt;/td&gt;
      &lt;td&gt;366&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;GIN jsonb_path_ops&lt;/td&gt;
      &lt;td&gt;171&lt;/td&gt;
      &lt;td&gt;73&lt;/td&gt;
      &lt;td&gt;408&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;GIN jsonb_ops&lt;/td&gt;
      &lt;td&gt;334&lt;/td&gt;
      &lt;td&gt;225&lt;/td&gt;
      &lt;td&gt;525&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;Generated columns and expression indexes are now very close in write cost, with generated columns slightly edging out on average. GIN jsonb_path_ops has become more competitive with both. However, the default GIN jsonb_ops variant is dramatically more expensive: 2× slower than expression indexes and generated columns. It must decompose the entire document into key-value pairs and insert entries for each one. The high variance is also worth noting: GIN jsonb_ops max of 525ms vs 366ms for expression indexes.&lt;/p&gt;

&lt;h2 id=&quot;choosing-the-right-approach&quot;&gt;Choosing the Right Approach&lt;/h2&gt;

&lt;p&gt;The benchmarks above tell a consistent story for workloads dominated by equality and range filters on a known set of fields:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Expression indexes&lt;/strong&gt; are the lowest-cost migration path. They add no schema structure, require no application changes to insert logic, and impose minimal write overhead. If your team already has a table in production and just needs to speed up a handful of known slow queries, a well-placed expression index is your first move. The catch: every query must exactly match the expression as written in the index definition, which can be fragile to maintain as codebases evolve.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Generated columns&lt;/strong&gt; take slightly more storage and impose more write overhead than expression indexes, but they offer something the others can’t: the extracted values become first-class columns. You can build composite indexes across them, reference them in views, expose them via ORMs, and sort or aggregate on them without embedding extraction logic everywhere. For new tables or for tables you’re willing to migrate, they’re the most maintainable long-term solution.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;GIN indexes&lt;/strong&gt; serve a different purpose. They’re the right tool when your query patterns are flexible or unknown – searching for the existence of a key, filtering on any field in an ad-hoc fashion, or supporting containment queries on arbitrarily-shaped documents. For those access patterns, they’re genuinely powerful and there’s no clean B-tree equivalent. But for consistent equality and range filters on known fields, they cost more in storage, impose higher write latency, and only work with one operator class (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;@&amp;gt;&lt;/code&gt;, not &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;=&lt;/code&gt;).&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here’s a rough decision guide:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Situation&lt;/th&gt;
      &lt;th&gt;Recommended approach&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;Unknown or ad-hoc field queries&lt;/td&gt;
      &lt;td&gt;GIN (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;@&amp;gt;&lt;/code&gt;, key existence)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Known fields, few queries, no schema change&lt;/td&gt;
      &lt;td&gt;Expression index&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Known fields, high query volume, evolving codebase&lt;/td&gt;
      &lt;td&gt;Generated columns&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Known fields + range queries (e.g., timestamps)&lt;/td&gt;
      &lt;td&gt;Generated columns + composite B-tree&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Mixed: some known fields + some ad-hoc&lt;/td&gt;
      &lt;td&gt;Generated columns + GIN (both)&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;h2 id=&quot;caveats-and-considerations&quot;&gt;Caveats and Considerations&lt;/h2&gt;

&lt;p&gt;Regardless of which approach you choose, a few things apply broadly:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The real win is making data typed and relational again.&lt;/strong&gt; Generated columns aren’t magic. The reason they (and expression indexes) outperform GIN for equality filters is that they produce typed scalar values with precise statistics, letting the planner make accurate row-count estimates and choose cheap comparison operations. JSONB is flexible but opaque; once you extract a field into a typed column or expression, Postgres can reason about it properly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Expression indexes require exact predicate matching.&lt;/strong&gt; An index on &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;cast(data-&amp;gt;&amp;gt;&apos;user_id&apos; AS INT)&lt;/code&gt; will not be used by a query written as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;(data-&amp;gt;&amp;gt;&apos;user_id&apos;)::int&lt;/code&gt;. The cast form must be identical. Generated columns avoid this fragility – any query that references the column name will benefit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Generated column expressions must be immutable.&lt;/strong&gt; The expression cannot reference functions that depend on time, session state, or anything external. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;NOW()&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CURRENT_USER&lt;/code&gt;, and similar functions are off-limits.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Generated columns cannot be directly updated.&lt;/strong&gt; Their value is always derived from the source column. If you UPDATE the JSONB &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;data&lt;/code&gt;, the generated columns recompute automatically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GIN maintenance overhead compounds on write-heavy tables.&lt;/strong&gt; GIN indexes build an internal pending list and flush it periodically (controlled by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;gin_pending_list_limit&lt;/code&gt;). Under sustained write load, this flushing can cause the latency spikes visible in the benchmark max values above. B-tree indexes don’t have this mechanism.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;These benchmarks cover one dataset shape and one machine.&lt;/strong&gt; At much larger row counts (hundreds of millions), cache-miss behavior and index bloat will dominate—relative rankings should hold, but absolute numbers will differ. When in doubt, benchmark on your own data before committing to a migration.&lt;/p&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;For workloads dominated by equality and range filters on a predictable set of JSONB fields, the data is clear: B-tree indexes on typed values – whether via expression indexes or generated columns – outperform GIN both on read latency and write throughput. GIN’s strength is flexibility, not speed for known-field access patterns; when you know exactly which fields you’ll filter on, a targeted B-tree beats the GIN every time.&lt;/p&gt;

&lt;p&gt;If you’re starting from scratch or are willing to migrate a table, generated columns are the most maintainable path. They make your frequently-queried fields easily accessible, eliminate JSONB extraction logic from your application’s query layer, and support composite indexes and range queries naturally. If you need to add indexing to an existing table without a schema change, expression indexes get you 90% of the way there with a fraction of the write overhead.&lt;/p&gt;

&lt;p&gt;GIN still belongs in your toolkit – but for the right job: ad-hoc containment searches, key-existence checks, and cases where the query patterns genuinely vary by document. For everything else, make your JSONB fields relational.&lt;/p&gt;
</description>
        <pubDate>Mon, 11 May 2026 06:00:00 +0000</pubDate>
        <link>http://richyen.com/postgres/2026/05/11/generated_columns_jsonb.html</link>
        <guid isPermaLink="true">http://richyen.com/postgres/2026/05/11/generated_columns_jsonb.html</guid>
        
        <category>PostgreSQL</category>
        
        <category>postgres</category>
        
        <category>jsonb</category>
        
        <category>generated</category>
        
        <category>columns</category>
        
        <category>indexing</category>
        
        <category>performance</category>
        
        
        <category>postgres</category>
        
      </item>
    
      <item>
        <title>Potential Consequences of Using Postgres as a Job Queue</title>
        <description>&lt;p&gt;&lt;em&gt;This post was originally published on the &lt;a href=&quot;https://techcommunity.microsoft.com/blog/adforpostgresql/potential-consequences-of-using-postgres-as-a-job-queue/4514332&quot;&gt;Microsoft Tech Community Blog&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;h2 id=&quot;introduction&quot;&gt;Introduction&lt;/h2&gt;

&lt;p&gt;At small scale, using Postgres as a job queue is totally fine, and I’d even say it’s the right call.  Fewer moving parts, one less system to manage, ACID guarantees on your jobs.  What’s not to love?&lt;/p&gt;

&lt;p&gt;The problem is that “small scale” has a ceiling, and the ceiling is lower than most people expect.  When you’ve got thousands of concurrent workers hammering a jobs table with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SELECT ... FOR UPDATE SKIP LOCKED&lt;/code&gt;, things start to behave in ways that aren’t obvious from the application layer.  CPU usage creeps up.  Also vacuum sometimes can’t keep up.  Finally, in the wait event stats, you start seeing ominous entries like &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;LWLock:MultiXactSLRU&lt;/code&gt; stacking up across many backends.&lt;/p&gt;

&lt;p&gt;This pattern has tripped up teams more than a few times, and it usually plays out the same way: everything works fine in dev and staging, then goes off a cliff in production once the concurrency gets real.  So let’s dig into why this happens, and what the alternatives look like.&lt;/p&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;the-typical-pattern&quot;&gt;The Typical Pattern&lt;/h2&gt;

&lt;p&gt;When using Postgres as a job queue, the standard approach looks something like this:&lt;/p&gt;

&lt;div class=&quot;language-sql highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;TABLE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;job_queue&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;id&lt;/span&gt;         &lt;span class=&quot;n&quot;&gt;bigserial&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;PRIMARY&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;KEY&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;status&lt;/span&gt;     &lt;span class=&quot;nb&quot;&gt;text&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;NOT&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;NULL&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;DEFAULT&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;pending&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;payload&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;jsonb&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;NOT&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;NULL&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;created_at&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;timestamptz&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;NOT&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;NULL&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;DEFAULT&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;now&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;locked_by&lt;/span&gt;  &lt;span class=&quot;nb&quot;&gt;text&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;locked_at&lt;/span&gt;  &lt;span class=&quot;n&quot;&gt;timestamptz&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;INDEX&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;idx_job_queue_status&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;ON&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;job_queue&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;status&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;WHERE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;status&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;pending&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Workers grab jobs with:&lt;/p&gt;

&lt;div class=&quot;language-sql highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;UPDATE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;job_queue&lt;/span&gt;
   &lt;span class=&quot;k&quot;&gt;SET&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;status&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;processing&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
       &lt;span class=&quot;n&quot;&gt;locked_by&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;worker-42&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
       &lt;span class=&quot;n&quot;&gt;locked_at&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;now&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;WHERE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;id&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
     &lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;id&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;job_queue&lt;/span&gt;
      &lt;span class=&quot;k&quot;&gt;WHERE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;status&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;pending&apos;&lt;/span&gt;
      &lt;span class=&quot;k&quot;&gt;ORDER&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;BY&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;created_at&lt;/span&gt;
      &lt;span class=&quot;k&quot;&gt;LIMIT&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;FOR&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;UPDATE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;SKIP&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;LOCKED&lt;/span&gt;
 &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
 &lt;span class=&quot;n&quot;&gt;RETURNING&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;And then mark them done:&lt;/p&gt;

&lt;div class=&quot;language-sql highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;UPDATE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;job_queue&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;SET&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;status&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;completed&apos;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;WHERE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;id&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;err&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Some users may &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DELETE&lt;/code&gt; the row entirely.  Either way, the lifecycle is: insert, lock-and-update, update-or-delete.  Repeated thousands of times per second.&lt;/p&gt;

&lt;p&gt;At low concurrency, this works very smoothly.  &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SKIP LOCKED&lt;/code&gt; means workers don’t block each other waiting for the same row.  Postgres handles the locking, visibility, and ordering.  It’s elegant.&lt;/p&gt;

&lt;p&gt;So where does it break?&lt;/p&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;the-multixact-slru-problem&quot;&gt;The MultiXact SLRU Problem&lt;/h2&gt;

&lt;p&gt;When multiple transactions hold locks on the same row, Postgres stores the set of lockers as a MultiXact ID – a pointer into a side structure under &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pg_multixact/&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;With &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SELECT ... FOR UPDATE SKIP LOCKED&lt;/code&gt;, users might think MultiXacts aren’t involved – after all, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SKIP LOCKED&lt;/code&gt; is supposed to avoid contention.  But in practice, with many concurrent workers all racing to lock rows, there are brief windows where multiple transactions reference the same row before one of them “wins” and the others skip.  If you combine this with any &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;FOR SHARE&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;FOR KEY SHARE&lt;/code&gt; locks (which are commonly created implicitly by foreign key checks), MultiXact IDs start accumulating quickly.&lt;/p&gt;

&lt;p&gt;The MultiXact data lives in SLRU buffers (Simple Least Recently Used) – a small, fixed-size shared memory cache.  When backends need to read or write MultiXact data, they acquire LWLocks to access these buffers.  Under high concurrency, this becomes a bottleneck:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;wait_event_type | wait_event
-----------------+-------------------
LWLock          | MultiXactMemberSLRU
LWLock          | MultiXactOffsetSLRU
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;You’ll see dozens or hundreds of backends piled up on these waits.  The SLRU cache is small (by design – it’s a fixed number of pages in shared memory), and when the working set of MultiXact lookups exceeds what fits in the cache, you get constant eviction and re-reads from disk.  Every lock acquisition and release on a job row potentially triggers a MultiXact SLRU lookup, and at thousands of concurrent sessions, those lookups serialize on LWLocks.&lt;/p&gt;

&lt;p&gt;The result: CPU gets pegged, throughput collapses, and latency spikes – not because the queries are expensive, but because the locking infrastructure itself is overwhelmed.&lt;/p&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;bloat-the-silent-killer&quot;&gt;Bloat: The Silent Killer&lt;/h2&gt;

&lt;p&gt;The other side of this coin is table and index bloat.  Every job row goes through multiple updates (and possibly a delete), and each of those operations creates a new tuple version in the heap.  The old versions stick around until &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;VACUUM&lt;/code&gt; cleans them up.&lt;/p&gt;

&lt;p&gt;On a busy job queue table:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Dead tuples accumulate faster than autovacuum can clean them.&lt;/strong&gt;  By the time autovacuum finishes one pass, tens of thousands of new dead tuples have appeared.  The table grows and grows.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Index bloat compounds the problem.&lt;/strong&gt;  Every index on the table also accumulates dead entries.  The partial index on &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;status = &apos;pending&apos;&lt;/code&gt; gets thrashed especially hard, since rows constantly enter and leave that condition.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Sequential scans get slower.&lt;/strong&gt;  As the table bloats, even index scans start doing more I/O because the heap pages are sparsely populated.  Vacuum reclaims space at the end of the table, but can’t reclaim space in the middle (unless the pages are completely empty).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Job queue tables can grow to tens of gigabytes when the actual “live” data was only a few megabytes.  It makes everything slower: scans, vacuum, even &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pg_dump&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;You can mitigate this by running vacuum more aggressively (lower &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;autovacuum_vacuum_scale_factor&lt;/code&gt;, higher &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;autovacuum_vacuum_cost_limit&lt;/code&gt;), or by partitioning the table and dropping old partitions.  But at some point, you’re fighting the fundamental mismatch between MVCC’s design goals and the write pattern of a job queue.&lt;/p&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;cpu-and-lock-overhead&quot;&gt;CPU and Lock Overhead&lt;/h2&gt;

&lt;p&gt;Beyond the SLRU contention and bloat, there’s just the raw overhead of using Postgres’s full transactional machinery for what is essentially a FIFO dispatch operation:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Every lock/unlock is a full WAL-logged transaction.&lt;/strong&gt;  Grabbing a job writes WAL.  Marking it complete writes WAL.  Deleting it writes WAL.  On a system processing thousands of jobs per second, the WAL volume from the job queue alone can saturate your &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;wal_writer&lt;/code&gt; and checkpoint processes.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SKIP LOCKED&lt;/code&gt; still touches rows.&lt;/strong&gt;  The name suggests rows are skipped, but Postgres still has to &lt;em&gt;find&lt;/em&gt; them, check their lock status, and move on.  With high concurrency, many workers end up scanning past the same locked rows before finding one they can claim.  This is wasted CPU.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Snapshot management overhead also becomes an issue.&lt;/strong&gt;  Each transaction needs a consistent snapshot, and with thousands of concurrent transactions, the ProcArray (the structure that tracks active transactions) becomes a contention point itself.  You might see &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;LWLock:ProcArrayLock&lt;/code&gt; waits alongside the MultiXact ones.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Vacuum contention.&lt;/strong&gt;  While vacuum is cleaning up dead tuples, it needs locks too.  On a table under constant write pressure, vacuum can interfere with the workers and vice versa.  I’ve seen systems where disabling autovacuum on the job queue table improved throughput in the short term.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;better-alternatives&quot;&gt;Better Alternatives&lt;/h2&gt;

&lt;p&gt;So what should you use instead?  It depends on your requirements, but there are several options that handle high-throughput job dispatch more gracefully than a Postgres table.&lt;/p&gt;

&lt;h3 id=&quot;advisory-locks-staying-in-postgres&quot;&gt;Advisory Locks (Staying in Postgres)&lt;/h3&gt;

&lt;p&gt;If you want to stay within Postgres and avoid adding infrastructure, advisory locks are worth considering for certain queue patterns.  Instead of locking rows, you lock on an abstract numeric key:&lt;/p&gt;

&lt;div class=&quot;language-sql highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;-- Worker tries to acquire a lock on the job ID&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pg_try_advisory_lock&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;id&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;job_queue&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;WHERE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;status&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;pending&apos;&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;ORDER&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;BY&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;created_at&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;LIMIT&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Advisory locks are lightweight – they don’t touch the heap, don’t create MultiXact entries, and don’t generate dead tuples.  They live entirely in shared memory.  The trade-off is that you lose the atomicity of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;FOR UPDATE SKIP LOCKED&lt;/code&gt;: you need to handle the case where a lock is acquired but the job processing fails, and you need to release the lock explicitly (or rely on session-end cleanup).&lt;/p&gt;

&lt;p&gt;This approach works well when the queue depth is manageable and you want to avoid the MVCC overhead.  But it’s still Postgres, so you’re still subject to connection limits, ProcArray overhead, and general resource contention at very high session counts.&lt;/p&gt;

&lt;h3 id=&quot;pgq-skytools&quot;&gt;pgq (Skytools)&lt;/h3&gt;

&lt;p&gt;pgq is purpose-built for exactly this problem.  It’s a queue implementation that sits inside Postgres but uses a batching model that avoids most of the row-level locking and MVCC pitfalls.  Events are written to a queue table, but consumers read them in batches and the queue maintenance is done via a ticker process that manages rotation.&lt;/p&gt;

&lt;p&gt;The key advantages:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;No row-level contention.  Consumers don’t lock individual rows.&lt;/li&gt;
  &lt;li&gt;Built-in batch processing.  Events are consumed in chunks, reducing transaction overhead.&lt;/li&gt;
  &lt;li&gt;Efficient cleanup.  Old events are rotated out rather than vacuumed row-by-row.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The downside is that pgq is not as actively maintained as it once was, and it adds operational complexity (the ticker daemon, consumer registration, etc.).  But for teams already deep in the Postgres ecosystem, it’s a battle-tested option.&lt;/p&gt;

&lt;h3 id=&quot;pgque&quot;&gt;PgQue&lt;/h3&gt;

&lt;p&gt;Coincidentally, during the writing of this post, &lt;a href=&quot;https://github.com/NikolayS/pgque&quot;&gt;Nikolay Samokhvalov has built PgQue&lt;/a&gt;, which is a derivative of pgq.  Like pgq, it sits inside Postgres, but ships as a single SQL file – no C extension and no external daemon – making it deployable on managed services like RDS, Aurora, Cloud SQL, AlloyDB, Supabase, and Neon.  Producers &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;INSERT&lt;/code&gt; events into rotating event tables (recycled via &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;TRUNCATE&lt;/code&gt; instead of row-by-row deletion), and consumers read batches by diffing two &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pg_snapshot&lt;/code&gt; values captured by a periodic ticker – so the hot path contains zero &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;UPDATE&lt;/code&gt;s, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DELETE&lt;/code&gt;s, or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SELECT ... FOR UPDATE SKIP LOCKED&lt;/code&gt;, and therefore produces no dead tuples on the event tables.  For a deeper dive into the algorithm, see &lt;a href=&quot;https://thebuild.com/blog/2026/05/03/pgque-two-snapshots-and-a-diff/&quot;&gt;Christophe Pettus’s writeup&lt;/a&gt;.&lt;/p&gt;

&lt;h3 id=&quot;redis&quot;&gt;Redis&lt;/h3&gt;

&lt;p&gt;For many teams, Redis is the natural choice for job queues.  Using Redis lists (BRPOPLPUSH or the Streams API), you get:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Sub-millisecond dispatch latency.  No disk I/O, no MVCC, no vacuum.&lt;/li&gt;
  &lt;li&gt;Atomic pop operations.  Workers grab jobs without any locking protocol.&lt;/li&gt;
  &lt;li&gt;Simple scaling.  Redis handles thousands of concurrent consumers trivially.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The trade-off is durability.  Redis can persist to disk, but it’s not ACID.  If Redis crashes between a pop and the job completing, you might lose or duplicate work (though Redis Streams with consumer groups mitigate this significantly).  For most job queue use cases, at-least-once delivery is acceptable, and Redis does that well.&lt;/p&gt;

&lt;h3 id=&quot;kafka&quot;&gt;Kafka&lt;/h3&gt;

&lt;p&gt;For truly high-throughput, distributed workloads, Apache Kafka is the heavyweight option.  Kafka partitions give you parallel consumption with ordering guarantees per partition, durable storage, and replay capability.  It’s the right tool when:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;You need to process thousands of events per second&lt;/li&gt;
  &lt;li&gt;Multiple consumers need to read the same events&lt;/li&gt;
  &lt;li&gt;You want event replay or audit trails&lt;/li&gt;
  &lt;li&gt;Your architecture is already event-driven&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The operational overhead is nontrivial – ZooKeeper (or KRaft), brokers, topic management, consumer group coordination.  But for teams already running Kafka for other reasons, adding a job queue topic is practically free.&lt;/p&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;choosing-the-right-tool&quot;&gt;Choosing the Right Tool&lt;/h2&gt;

&lt;p&gt;Here’s a rough decision guide:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Scenario&lt;/th&gt;
      &lt;th&gt;Recommendation&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;Under 100 concurrent workers, simple jobs&lt;/td&gt;
      &lt;td&gt;Postgres with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SKIP LOCKED&lt;/code&gt; is fine&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Moderate concurrency, want to stay in Postgres&lt;/td&gt;
      &lt;td&gt;Advisory locks or pgq&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;High throughput, low-latency dispatch&lt;/td&gt;
      &lt;td&gt;Redis (Lists or Streams)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Massive scale, distributed, event replay&lt;/td&gt;
      &lt;td&gt;Kafka&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;Many teams that start with Postgres (reasonably) hit scaling problems and then try to fix Postgres rather than recognizing that the workload has outgrown the tool.  They throw more autovacuum workers at it, increase &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;max_connections&lt;/code&gt;, add connection poolers – all of which help at the margins, but don’t address the fundamental issue: Postgres’s MVCC and locking machinery wasn’t designed for this access pattern at high concurrency.&lt;/p&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;Postgres is great, but it can’t be the best tool for every job.  Using it as a job queue is a perfectly valid choice when your scale is modest.  But when you’re running thousands of concurrent workers, the combination of MultiXact SLRU contention, heap bloat, vacuum pressure, and raw locking overhead will eventually push you toward a purpose-built solution.&lt;/p&gt;

&lt;p&gt;The good news is that you don’t have to rip out everything.  Advisory locks can buy you headroom without adding infrastructure.  Redis can handle dispatch while Postgres keeps owning the data.  And if you’re already using Kafka, a job topic is a natural fit.  Take your pick – there are many queueing options out there!&lt;/p&gt;
</description>
        <pubDate>Mon, 04 May 2026 06:00:00 +0000</pubDate>
        <link>http://richyen.com/postgres/2026/05/04/postgres_job_queue.html</link>
        <guid isPermaLink="true">http://richyen.com/postgres/2026/05/04/postgres_job_queue.html</guid>
        
        <category>PostgreSQL</category>
        
        <category>postgres</category>
        
        <category>performance</category>
        
        <category>scaling</category>
        
        <category>job-queue</category>
        
        <category>multixact</category>
        
        <category>lwlock</category>
        
        <category>advisory-locks</category>
        
        <category>redis</category>
        
        <category>kafka</category>
        
        <category>pgq</category>
        
        
        <category>postgres</category>
        
      </item>
    
      <item>
        <title>Understanding Bitmap Heap Scans in PostgreSQL</title>
        <description>&lt;h1 id=&quot;introduction&quot;&gt;Introduction&lt;/h1&gt;

&lt;p&gt;When people first start reading PostgreSQL execution plans, they quickly learn a few common scan types: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Seq Scan&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Index Scan&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Index Only Scan&lt;/code&gt;.  But eventually another one appears that is less obvious: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Bitmap Heap Scan&lt;/code&gt;, which is almost always accompanied by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Bitmap Index Scan&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;At first glance, it sounds like two scans on the same table – a very inefficient choice?! But bitmap scans are actually one of the planner’s most practical tools for balancing random I/O vs sequential access.  Understanding how they work can make execution plans much easier to interpret, so we’ll dive into that a little bit today.&lt;/p&gt;

&lt;hr /&gt;

&lt;h1 id=&quot;the-basic-idea&quot;&gt;The Basic Idea&lt;/h1&gt;

&lt;p&gt;A bitmap scan is a two-step process:&lt;/p&gt;

&lt;p&gt;Step 1: Build a bitmap of matching rows using one or more indexes.&lt;/p&gt;

&lt;p&gt;Step 2: Visit the heap pages containing those rows referenced in the bitmap.&lt;/p&gt;

&lt;p&gt;In an execution plan this usually appears as:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Bitmap Heap Scan on orders
-&amp;gt; Bitmap Index Scan on orders_customer_id_idx
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The important part is that the index lookup and heap access are separated – this separation allows Postgres to explain heap access costs and actuals more clearly.&lt;/p&gt;

&lt;hr /&gt;

&lt;h1 id=&quot;why-not-just-use-an-index-scan&quot;&gt;Why Not Just Use an Index Scan?&lt;/h1&gt;

&lt;p&gt;With a normal index scan, the query executor does something like this:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Find a matching entry in the index&lt;/li&gt;
  &lt;li&gt;Jump to the heap page&lt;/li&gt;
  &lt;li&gt;Fetch the row&lt;/li&gt;
  &lt;li&gt;Repeat&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If the query returns only a few rows, this works well.  But if the query returns thousands of rows scattered across the table, the database ends up doing many random heap fetches.  Random I/O can become expensive, so a bitmap scan solves this problem.&lt;/p&gt;

&lt;hr /&gt;

&lt;h1 id=&quot;how-the-bitmap-is-built&quot;&gt;How the Bitmap Is Built&lt;/h1&gt;

&lt;p&gt;During the Bitmap Index Scan phase, the executor does not immediately fetch rows.  Instead it records which heap pages contain matching rows.  Conceptually, the structure looks like this:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Page 101 -&amp;gt; rows 2, 7
Page 205 -&amp;gt; rows 1, 3, 8
Page 410 -&amp;gt; row 5
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;These page references are stored as a bitmap structure in memory.  Once the bitmap is complete, the executor can visit heap pages in physical order rather than jumping around randomly.  Visiting heap pages in physical order means less random I/O and therefore less latency.&lt;/p&gt;

&lt;hr /&gt;

&lt;h1 id=&quot;multiple-indexes-can-be-combined&quot;&gt;Multiple Indexes Can Be Combined&lt;/h1&gt;

&lt;p&gt;One particularly powerful feature is that bitmap scans allow the query planner to combine multiple indexes.  For example:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;WHERE status = &apos;active&apos;
AND created_at &amp;gt;= &apos;2025-01-01&apos;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The plan might look like:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Bitmap Heap Scan
-&amp;gt; BitmapAnd
-&amp;gt; Bitmap Index Scan on status_idx
-&amp;gt; Bitmap Index Scan on created_at_idx
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Each index produces a bitmap, and the planner combines them using logical operations, such as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;BitmapAnd&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;BitmapOr&lt;/code&gt;.  This allows the planner to efficiently use multiple indexes even when a single composite index does not exist.&lt;/p&gt;

&lt;hr /&gt;

&lt;h1 id=&quot;when-does-the-planner-chooses-bitmap-scans&quot;&gt;When Does the Planner Chooses Bitmap Scans?&lt;/h1&gt;

&lt;p&gt;The planner usually prefers bitmap scans in situations where the query returns more rows than a typical index scan, but not enough rows to justify a full sequential scan.  In other words, bitmap scans often appear in the middle selectivity range.&lt;/p&gt;

&lt;p&gt;Very roughly:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Selectivity&lt;/th&gt;
      &lt;th&gt;Likely Plan&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;Very small&lt;/td&gt;
      &lt;td&gt;Index Scan&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Medium&lt;/td&gt;
      &lt;td&gt;Bitmap Heap Scan&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Very large&lt;/td&gt;
      &lt;td&gt;Seq Scan&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;This is not a strict rule, but it helps explain the planner’s reasoning.&lt;/p&gt;

&lt;hr /&gt;

&lt;h1 id=&quot;pros-and-cons&quot;&gt;Pros and Cons&lt;/h1&gt;

&lt;p&gt;As with everything in databases, there’s no free lunch.  Here are some advantages and disadvantages for bitmap scans&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Advantages of Bitmap Heap Scans
    &lt;ul&gt;
      &lt;li&gt;Reduced Random I/O: By grouping heap page accesses, bitmap scans avoid excessive random disk reads.&lt;/li&gt;
      &lt;li&gt;Ability to Combine Indexes: Bitmap operations allow the query planner to use multiple independent indexes efficiently.&lt;/li&gt;
      &lt;li&gt;Better Performance for Medium Selectivity: Queries returning thousands of rows often benefit from bitmap access patterns.&lt;/li&gt;
      &lt;li&gt;Predictable Heap Access: Because heap pages are visited in order, caching behavior tends to improve.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;Disadvantages of Bitmap Heap Scans
    &lt;ul&gt;
      &lt;li&gt;Memory Usage: The bitmap structure is stored in memory.  If the result set becomes too large, the query executor may switch to a lossy bitmap, where only page-level information is stored.  This can cause additional filtering work later.&lt;/li&gt;
      &lt;li&gt;Two-Phase Execution: Because the bitmap must be built before heap access begins, the query cannot stream rows immediately.  This can increase latency for queries expecting early rows.&lt;/li&gt;
      &lt;li&gt;Extra CPU Work: Maintaining and combining bitmap structures adds overhead compared to simple index scans.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;h1 id=&quot;lossy-bitmaps&quot;&gt;Lossy Bitmaps&lt;/h1&gt;

&lt;p&gt;When memory limits are reached, the query executor may degrade the bitmap representation.  Instead of tracking individual tuple offsets, it only records:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Page 205 -&amp;gt; possible matches
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;During the heap scan, the executor must then recheck all rows on that page.  In execution plans you may see mention of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Recheck Cond&lt;/code&gt;.  This indicates that the bitmap became lossy.  While still correct, this can reduce efficiency.&lt;/p&gt;

&lt;hr /&gt;

&lt;h1 id=&quot;final-thoughts&quot;&gt;Final Thoughts&lt;/h1&gt;

&lt;p&gt;Bitmap heap scans are one of the planner’s most practical optimization tools, as they allow the database to reduce random I/O, combine multiple indexes, and handle medium-sized result sets efficiently.&lt;/p&gt;

&lt;p&gt;While they may look complicated at first, the core idea is simple: Find matching rows first, then fetch heap pages efficiently.  What a great concept!&lt;/p&gt;
</description>
        <pubDate>Mon, 27 Apr 2026 08:00:00 +0000</pubDate>
        <link>http://richyen.com/postgres/2026/04/27/bitmap_heap_scan.html</link>
        <guid isPermaLink="true">http://richyen.com/postgres/2026/04/27/bitmap_heap_scan.html</guid>
        
        <category>PostgreSQL</category>
        
        <category>postgres</category>
        
        <category>performance</category>
        
        <category>query-planner</category>
        
        <category>indexing</category>
        
        
        <category>postgres</category>
        
      </item>
    
      <item>
        <title>The Postgres Performance Triangle</title>
        <description>&lt;p&gt;Everyone who’s gone at least knee-deep in  photography knows there’s this idea of the &lt;em&gt;exposure triangle&lt;/em&gt;: aperture, shutter speed, and ISO. Depending on what you’re going for artistically, you adjust the three parameters, knowing that there are trade-offs in doing so.  After working on a few cases, and presenting solutions to customers, I’ve started to think about Postgres performance tuning in a similar way – there are basic parameters that can be tuned, and there are trade-offs for the choices DBAs make:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Memory Allocation&lt;/li&gt;
  &lt;li&gt;Disk I/O&lt;/li&gt;
  &lt;li&gt;Concurrency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each of these (in broad strokes) affects throughput – how much work your system gets done.&lt;/p&gt;

&lt;p&gt;Caveat: I know that in the academic sense, “throughput” doesn’t quite capture the balance of these concepts, but please bear with me!&lt;/p&gt;

&lt;p&gt;Let’s talk about how each of these three work together with the whole system, and what the trade-offs look like.&lt;/p&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;memory-allocation&quot;&gt;Memory Allocation&lt;/h2&gt;

&lt;p&gt;When you increase memory allocation in Postgres, whether it’s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;shared_buffers&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;work_mem&lt;/code&gt;, things tend to feel smoother.  Most notably, queries spill to disk less often, sorts and joins stay in memory, cache hit rates improve.  But there’s a trade-off that’s easy to miss at first, especially with these two parameters.  A single complex query can consume multiple chunks of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;work_mem&lt;/code&gt; (see &lt;a href=&quot;https://mydbanotebook.org/posts/work_mem-its-a-trap/&quot;&gt;Laetitia’s excellent post about it&lt;/a&gt;). Multiply that across concurrent queries, and you begin to see the OS consuming swap space, churning at checkpoints, and even OOM Killer getting invoked.  So while more memory &lt;em&gt;can&lt;/em&gt; make things faster, it also quietly reduces how much concurrency your system can safely handle.&lt;/p&gt;

&lt;p&gt;I’d relate this to aperture – you can throw money at some fast glass, but you also get shallower depth of field (in an annoying way).&lt;/p&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;disk-io&quot;&gt;Disk I/O&lt;/h2&gt;

&lt;p&gt;Disk is where things go when memory isn’t enough, or when an access pattern requires it.  We see examples of this in sequential scans, random index lookups, and temporary files from sorts or hashes.  Lowering &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;work_mem&lt;/code&gt; might increase disk I/O due to sorts spilling to temp files, for example.  We can try to minimize disk I/O by adding indexes, increasing &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;work_mem&lt;/code&gt;, or simply rewriting queries.&lt;/p&gt;

&lt;p&gt;Another way we can try to affect disk I/O is to tinker with the costs, to encourage the query planner to choose one scan method over the other.  In any case, our attempts to balance disk I/O and memory usage can be pretty straightforward at first, but could become complicated at scale.  That’s where partitioning and read-only replicas come in, but I’m beginning to digress…&lt;/p&gt;

&lt;p&gt;Indexes, in particular, are where things start to get interesting.  Adding an index can feel like an easy win, as it leads to fewer rows scanned and less CPU work per query, along with less disk activity, but there are trade-offs:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Every &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;INSERT&lt;/code&gt; will update every relevant index&lt;/li&gt;
  &lt;li&gt;Every &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;UPDATE&lt;/code&gt; can potentially rewrite index entries&lt;/li&gt;
  &lt;li&gt;Every &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DELETE&lt;/code&gt; leaves behind cleanup work (vacuum)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At scale, we also see other effects:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Indexes get large&lt;/li&gt;
  &lt;li&gt;Cache hit rates drop (because there’s more to cache)&lt;/li&gt;
  &lt;li&gt;Random I/O increases&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So an index that helps one query might quietly make others worse, or make writes more expensive.&lt;/p&gt;

&lt;p&gt;It’s like raising ISO to compensate for low light. You get the shot, but the noise shows up somewhere else.&lt;/p&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;concurrency&quot;&gt;Concurrency&lt;/h2&gt;

&lt;p&gt;So far, this has all been somewhat per-query. But things change when you introduce concurrency.  In a high-demand service, the instinct is to increase &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;max_connections&lt;/code&gt; to allow the service to scale up, but in my experience there’s a price to pay for this kind of concurrency.  Some people fail to notice that each connection brings its own memory usage, takes up a spot in Postgres’ internal data structures, and puts the system at risk for increased CPU demand and resource contention.&lt;/p&gt;

&lt;p&gt;In the photography analogy, you can turn down the ISO very low on a bright and sunny day, but that won’t be enough.  Soon, you’ll be closing the aperture and increasing the shutter speed, and then you lose your ability to create the artistic feel that you’re actually trying to go for.  So what do photographers do?  They use an ND filter to limit how much light hits the sensor.&lt;/p&gt;

&lt;p&gt;In Postgres, that “ND filter” is something like a connection pooler, like &lt;a href=&quot;https://www.pgbouncer.org/&quot;&gt;PgBouncer&lt;/a&gt;.  Instead of letting thousands of connections compete for CPU: You cap active queries, you allocate more resources to each actual DB session, and you trade a bit of latency for stability.  Sometimes, to keep your throughput, you need some additional accessories.&lt;/p&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;the-art-of-postgres&quot;&gt;The Art of Postgres&lt;/h2&gt;

&lt;p&gt;As a DBA, you can calculate optimal index usage, memory sizing, and expected I/O patterns, but those calculations tend to assume a steady state.  Every DBA knows that real production systems are always changing, due to traffic patterns, scaling, and new features getting rolled out on the application side.  As the organization changes, the work to keep the database performant is dependent upon the DBA being both a Database Administrator as well as a Database Artist, working with internal teams to know which indexes to add/drop, how much concurrency to allow, and how to allocate memory without running out of it.&lt;/p&gt;

&lt;p&gt;Instead of asking, “What’s the optimal configuration?” it might be more useful to ask these questions:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Where is my system currently paying the cost—memory, disk, or CPU?&lt;/li&gt;
  &lt;li&gt;If I relieve pressure here, where does it move?&lt;/li&gt;
  &lt;li&gt;How much can we tolerate that new pressure?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Costs don’t disappear – they just shift – and it’s the DBA’s job to help decision-makers decide where to shift it to.&lt;/p&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;There’s more to photography than exposure – there’s composition, color-correction, external lighting, and so much more.  In the same way, this discussion has just been one part of database administration.  There’s so much more to go over, in terms of creating a robust and scalable database.  I wanted to highlight this topic because I do find that some users tend to approach database architecture without considering all the trade-offs.  We can definitely get the database to peform well, but there’s no one-size-fits-all solution for every situation.  It takes thought, planning, testing, and discussion with stakeholders to come up with a good solution.&lt;/p&gt;
</description>
        <pubDate>Mon, 20 Apr 2026 08:00:00 +0000</pubDate>
        <link>http://richyen.com/postgres/2026/04/20/throughput_triangle.html</link>
        <guid isPermaLink="true">http://richyen.com/postgres/2026/04/20/throughput_triangle.html</guid>
        
        <category>PostgreSQL</category>
        
        <category>postgres</category>
        
        <category>performance</category>
        
        
        <category>postgres</category>
        
      </item>
    
      <item>
        <title>Understanding PostgreSQL Wait Events</title>
        <description>&lt;h1 id=&quot;introduction&quot;&gt;Introduction&lt;/h1&gt;

&lt;p&gt;One of the most useful debugging tools in modern PostgreSQL is the wait event system.  When a query slows down or a database becomes CPU bound, a natural question is: “What are sessions actually waiting on?” Postgres exposes this information through the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pg_stat_activity&lt;/code&gt; view via two columns:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;wait_event_type
wait_event
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;These fields reveal what the backend process is blocked on at a given moment.  Among the different wait types, one category tends to cause confusion:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;LWLock
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;If you’ve ever seen dashboards full of LWLock waits, you’re not alone in wondering what they mean and whether they’re a problem.&lt;/p&gt;

&lt;hr /&gt;

&lt;h1 id=&quot;where-wait-events-appear&quot;&gt;Where Wait Events Appear&lt;/h1&gt;

&lt;p&gt;The easiest way to see wait events is:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;SELECT pid,
wait_event_type,
wait_event,
state,
query
FROM pg_stat_activity
WHERE state != &apos;idle&apos;;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Example output might look like:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;pid&lt;/th&gt;
      &lt;th&gt;wait_event_type&lt;/th&gt;
      &lt;th&gt;wait_event&lt;/th&gt;
      &lt;th&gt;state&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;1234&lt;/td&gt;
      &lt;td&gt;Lock&lt;/td&gt;
      &lt;td&gt;transactionid&lt;/td&gt;
      &lt;td&gt;active&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;5678&lt;/td&gt;
      &lt;td&gt;LWLock&lt;/td&gt;
      &lt;td&gt;buffer_content&lt;/td&gt;
      &lt;td&gt;active&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;9012&lt;/td&gt;
      &lt;td&gt;IO&lt;/td&gt;
      &lt;td&gt;DataFileRead&lt;/td&gt;
      &lt;td&gt;active&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;Each category represents a different kind of wait.  Common types include:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Lock&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;LWLock&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;IO&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Client&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;IPC&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Activity&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Among these, LWLock waits often appear during performance incidents.&lt;/p&gt;

&lt;hr /&gt;

&lt;h1 id=&quot;what-is-an-lwlock&quot;&gt;What Is an LWLock?&lt;/h1&gt;

&lt;p&gt;LWLock stands for &lt;strong&gt;Lightweight Lock&lt;/strong&gt;.  These are &lt;strong&gt;internal&lt;/strong&gt; Postgres synchronization primitives used to coordinate access to shared memory structures.  Note that they are &lt;strong&gt;NOT&lt;/strong&gt; related to lock contention on tables, or deadlocking when performing DML.  LWLocks protect important internal structures such as:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;shared buffers&lt;/li&gt;
  &lt;li&gt;WAL buffers&lt;/li&gt;
  &lt;li&gt;lock tables&lt;/li&gt;
  &lt;li&gt;SLRU caches&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Because these structures are accessed by many processes simultaneously, Postgres must coordinate access carefully.&lt;/p&gt;

&lt;hr /&gt;

&lt;h1 id=&quot;why-lwlock-waits-appear&quot;&gt;Why LWLock Waits Appear&lt;/h1&gt;

&lt;p&gt;In healthy systems, LWLocks are acquired and released very quickly.  However, they can become visible when:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;contention increases&lt;/li&gt;
  &lt;li&gt;many sessions access the same internal structure&lt;/li&gt;
  &lt;li&gt;CPU saturation occurs&lt;/li&gt;
  &lt;li&gt;shared memory structures become hot spots&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Seeing LWLock waits in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pg_stat_activity&lt;/code&gt; doesn’t automatically mean something is wrong.  But persistent LWLock contention usually indicates a scaling issue somewhere in the workload.&lt;/p&gt;

&lt;hr /&gt;

&lt;h1 id=&quot;common-lwlock-wait-events&quot;&gt;Common LWLock Wait Events&lt;/h1&gt;

&lt;p&gt;A few LWLock events appear frequently during real-world incidents.&lt;/p&gt;

&lt;p&gt;Understanding them can help narrow down the root cause.&lt;/p&gt;

&lt;h3 id=&quot;buffer_content&quot;&gt;buffer_content&lt;/h3&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;wait_event_type = LWLock
wait_event = buffer_content
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This occurs when Postgres processes compete to access a shared buffer page.&lt;/p&gt;

&lt;p&gt;Typical causes include:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;many concurrent updates to the same rows&lt;/li&gt;
  &lt;li&gt;heavy index modifications&lt;/li&gt;
  &lt;li&gt;hot tables receiving high write volume&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you see these locks, try these troubleshooting steps:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;check for write-heavy workloads&lt;/li&gt;
  &lt;li&gt;inspect tables experiencing frequent updates&lt;/li&gt;
  &lt;li&gt;look for missing indexes causing excessive page access&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;walwritelock&quot;&gt;WALWriteLock&lt;/h3&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;wait_event = WALWriteLock
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This indicates contention while writing to the Write-Ahead Log (WAL).&lt;/p&gt;

&lt;p&gt;Common causes:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;high write throughput&lt;/li&gt;
  &lt;li&gt;large batch inserts or updates&lt;/li&gt;
  &lt;li&gt;slow storage affecting WAL flushes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Possible diagnostic steps:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;examine WAL generation rate&lt;/li&gt;
  &lt;li&gt;check disk latency&lt;/li&gt;
  &lt;li&gt;review bulk write workloads&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In some systems this appears as commit latency spikes.&lt;/p&gt;

&lt;h3 id=&quot;walinsertlock&quot;&gt;WALInsertLock&lt;/h3&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;wait_event = WALInsertLock
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This occurs when multiple sessions attempt to insert WAL records simultaneously.  It usually appears when:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;many concurrent transactions are committing&lt;/li&gt;
  &lt;li&gt;high insert/update workloads exist&lt;/li&gt;
  &lt;li&gt;transaction throughput is extremely high&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Postgres versions over time have reduced contention here by increasing WAL insertion slots.  Still, very high write concurrency can trigger it.&lt;/p&gt;

&lt;h3 id=&quot;procarraylock&quot;&gt;ProcArrayLock&lt;/h3&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;wait_event = ProcArrayLock
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This lock protects Postgres’ internal structure tracking active transactions.  It is often associated with:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;snapshot creation&lt;/li&gt;
  &lt;li&gt;visibility checks&lt;/li&gt;
  &lt;li&gt;large numbers of active connections&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Possible causes include:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;very high connection counts&lt;/li&gt;
  &lt;li&gt;long-running transactions&lt;/li&gt;
  &lt;li&gt;frequent snapshot creation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Connection pooling (and lowering &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;max_connection&lt;/code&gt;) often helps reduce this type of contention.&lt;/p&gt;

&lt;h3 id=&quot;clogcontrollock--slru-locks&quot;&gt;CLogControlLock / SLRU Locks&lt;/h3&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;wait_event = CLogControlLock
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;These involve the SLRU (Simple Least Recently Used) subsystem, which tracks transaction commit status.  Heavy contention here can appear when:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;extremely high transaction rates exist&lt;/li&gt;
  &lt;li&gt;frequent visibility checks occur&lt;/li&gt;
  &lt;li&gt;many short transactions are executed&lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;h1 id=&quot;diagnosing-lwlock-problems&quot;&gt;Diagnosing LWLock Problems&lt;/h1&gt;

&lt;p&gt;When investigating LWLock waits, a few steps usually help.&lt;/p&gt;

&lt;h3 id=&quot;1-look-for-dominant-wait-events&quot;&gt;1. Look for dominant wait events&lt;/h3&gt;

&lt;p&gt;Start by identifying which LWLock appears most frequently:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;SELECT wait_event, count(*)
FROM pg_stat_activity
WHERE wait_event_type = &apos;LWLock&apos;
GROUP BY wait_event
ORDER BY count(*) DESC;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;2-examine-workload-characteristics&quot;&gt;2. Examine workload characteristics&lt;/h3&gt;

&lt;p&gt;Questions to ask:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Are there many concurrent writers?&lt;/li&gt;
  &lt;li&gt;Is a single table receiving heavy updates?&lt;/li&gt;
  &lt;li&gt;Are there extremely high transaction rates?&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;3-check-connection-counts&quot;&gt;3. Check connection counts&lt;/h3&gt;

&lt;p&gt;Large numbers of connections can amplify contention.  Connection pooling often reduces LWLock pressure significantly.&lt;/p&gt;

&lt;h3 id=&quot;4-look-at-query-patterns&quot;&gt;4. Look at query patterns&lt;/h3&gt;

&lt;p&gt;High-frequency queries touching the same rows or pages can create hotspots.&lt;/p&gt;

&lt;hr /&gt;

&lt;h1 id=&quot;final-thoughts&quot;&gt;Final Thoughts&lt;/h1&gt;

&lt;p&gt;PostgreSQL’s wait event system provides valuable insight into what the database is doing internally.  LWLocks, in particular, reveal contention inside shared memory structures that are otherwise invisible.  When investigating performance issues, a good rule of thumb is: &lt;em&gt;If many sessions are waiting on the same LWLock, there is usually a workload hotspot somewhere.&lt;/em&gt; Once you know where the contention lives, the path toward fixing it becomes much clearer.&lt;/p&gt;
</description>
        <pubDate>Mon, 13 Apr 2026 08:00:00 +0000</pubDate>
        <link>http://richyen.com/postgres/2026/04/13/wait_events.html</link>
        <guid isPermaLink="true">http://richyen.com/postgres/2026/04/13/wait_events.html</guid>
        
        <category>PostgreSQL</category>
        
        <category>postgres</category>
        
        <category>performance</category>
        
        <category>troubleshooting</category>
        
        <category>wait-events</category>
        
        
        <category>postgres</category>
        
      </item>
    
      <item>
        <title>WAL as a Data Distribution Layer</title>
        <description>&lt;h1 id=&quot;introduction&quot;&gt;Introduction&lt;/h1&gt;

&lt;p&gt;Every so often, I talk to someone working in data analytics who wants access to production data, or at least a snapshot of it.  Sometimes, they tell me about their ETL setup, which takes hours to refresh and can be brittle, with a lot of monitoring around it.  For them, it works, but it sometimes gets me wondering if they need all that plumbing to get a snapshot of their live dataset.  Back at Turnitin, I set up a way to get people access to production data without having to snapshot nightly, and I thought maybe I should share it with people here.&lt;/p&gt;

&lt;h1 id=&quot;common-implementations-and-their-risks&quot;&gt;Common Implementations and Their Risks&lt;/h1&gt;

&lt;p&gt;Typical solutions that we might encounter as we give people a little bit of access to production data:&lt;/p&gt;

&lt;h3 id=&quot;1-query-the-primary&quot;&gt;1. Query the primary&lt;/h3&gt;

&lt;p&gt;This is generally a bad idea, since you don’t want users getting access to the production prirmary, lest they make some mistakes or do something to lock up tables that prevent customers from using your apps.  Even with a read-only user, large data analytics queries could cause unwanted interference that negatively affect your uptime.  This is almost certainly not the way to go.&lt;/p&gt;

&lt;h3 id=&quot;2-query-a-streaming-replica&quot;&gt;2. Query a streaming replica&lt;/h3&gt;

&lt;p&gt;This is better, but doing this is not free.  Long-running queries can create replay lag, vacuum conflicts can cancel queries, and I/O contention can affect the primary upstream.  It’s safer since users are forced to be read-only, but that still carries risk.&lt;/p&gt;

&lt;h3 id=&quot;3-nightly-snapshots--rebuilds&quot;&gt;3. Nightly snapshots / rebuilds&lt;/h3&gt;

&lt;p&gt;Having time-based snapshots and rebuilds are the most common form of getting data out to analysts.  ETL queries run at night (or some other specified regular interval) and provide the information needed to do the necessary work.  This works, but is another piece of software that produces somewhat stale data, depending on how much stale-ness can be tolerated.&lt;/p&gt;

&lt;h1 id=&quot;once-upon-a-time-before-streaming-replication&quot;&gt;Once Upon a Time, Before Streaming Replication&lt;/h1&gt;

&lt;p&gt;If you’ve spent any time in Postgres, you already understand streaming replication.  Primary sends WAL to standby, and standby replays the WAL stream.  All the tutorials talk about using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pg_basebackup&lt;/code&gt;, setting &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;hot_standby&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;standby.signal&lt;/code&gt; and configuring &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;primary_conninfo&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;However, many people don’t know that before streaming replication, there was log shipping.  Introduced in v. 8.2, it was the predecessor to what eventually became hot standby/streaming replication in v. 9.0.  Instead of maintaining a live connection between primary and standby, the two clusters are decoupled.  WAL files are shipped (via &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;scp&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;rsync&lt;/code&gt; or some other mechanism – maybe even NFS) to the replica, and then replayed there.&lt;/p&gt;

&lt;h1 id=&quot;log-shipping-hits-a-different-point-on-the-tradeoff-curve&quot;&gt;Log Shipping Hits a Different Point on the Tradeoff Curve&lt;/h1&gt;

&lt;p&gt;With WAL log shipping the standby never connects to the primary, and the primary never tracks the standby, and therefore there is no backpressure mechanism (i.e. no cancelled queries because of conflict with recovery, no need for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;hot_standby_feedback&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;While you may not get up-to-the-millisecond minimized replication lag, you get pretty close to real-time data.  In some cases, this lag may even be desirable – you could throttle the playback so you are an hour behind, even giving yourself some time to look at a table’s state before someone fat-fingers an &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;UPDATE&lt;/code&gt; without a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WHERE&lt;/code&gt; clause.&lt;/p&gt;

&lt;h1 id=&quot;a-subtle-but-important-detail&quot;&gt;A Subtle but Important Detail&lt;/h1&gt;

&lt;p&gt;Postgres doesn’t force you to choose one mechanism over the other.  A standby can use both &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;primary_conninfo&lt;/code&gt; AND &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;restore_command&lt;/code&gt;.  The way it works is that it will toggle between the two, depending on availability.  If the primary is disconnected for some reason, it will switch over to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;restore_command&lt;/code&gt; until it cannot find the WAL file it wants, and then it flips back to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;primary_conninfo&lt;/code&gt; again.&lt;/p&gt;

&lt;p&gt;Log shipping isn’t just a legacy mode, but it’s part of the replication continuum.  It’s like incremental backup, except that your backup is always full-loaded and can be queried against.  For these reasons, keeping your WAL files around is a very good practice.&lt;/p&gt;

&lt;h1 id=&quot;architecture-pattern-introduce-a-wal-hub&quot;&gt;Architecture Pattern: Introduce a WAL Hub&lt;/h1&gt;

&lt;p&gt;Instead of thinking in terms or replication happening between a primary and a number of standbys, it may be useful to think about a central WAL archive host, even if it’s an S3 bucket, so that many consumers can access data at any point in time.&lt;/p&gt;

&lt;p&gt;These consumers can be analytics standbys, QA environments, or ad-hoc data sandboxes – or whatever else you want to give a copy of near-realtime production data to, without risking replication backpressure or compromising network security.&lt;/p&gt;

&lt;h1 id=&quot;a-hands-on-approach&quot;&gt;A Hands-On Approach&lt;/h1&gt;

&lt;p&gt;I created a &lt;a href=&quot;https://github.com/richyen/toolbox/tree/master/demos/wal_shipping&quot;&gt;simple demo&lt;/a&gt; that sets this up end-to-end.  It sets up 3 containers in Docker – a primary, standby, and a mock WAL archive location.  &lt;em&gt;Disclaimer:&lt;/em&gt; yes, I used AI to help me generate the scripts, but it’s exactly how I had it set up at Turnitin (yes, we used &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;rsyncd&lt;/code&gt; back in 2009 – there might be better stuff out there these days).&lt;/p&gt;

&lt;p&gt;Some key configuration params for clarity:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;archive_command&lt;/code&gt; pushes WAL files to a directory&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;restore_command&lt;/code&gt; pulls WAL files on the standby&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;standby.signal&lt;/code&gt; enables continuous recovery&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;hot_standby=on&lt;/code&gt; allows read-only queries&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;archive_mode=on&lt;/code&gt; not entirely necessary, but for posterity&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Note that in this example, some characteristics of the standby:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;No &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;primary_conninfo&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;No replication slots used&lt;/li&gt;
  &lt;li&gt;No entries in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pg_stat_replication&lt;/code&gt; show up on the primary.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you want, you can set up traditional streaming replication in parallel to this log shipping standby – it doesn’t interfere with the log shipping so long as WAL files get to the archive location.&lt;/p&gt;

&lt;h1 id=&quot;why-this-pattern-deserves-more-attention&quot;&gt;Why This Pattern Deserves More Attention&lt;/h1&gt;

&lt;p&gt;Most teams default to streaming replication because it’s the most visible feature.&lt;/p&gt;

&lt;p&gt;But Postgres replication isn’t one thing; it’s a set of primitives:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;WAL generation&lt;/li&gt;
  &lt;li&gt;WAL transport&lt;/li&gt;
  &lt;li&gt;WAL replay&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Streaming replication couples all three and log shipping lets you separate them.  And once you do that, new architectures open up!&lt;/p&gt;
</description>
        <pubDate>Mon, 06 Apr 2026 08:00:00 +0000</pubDate>
        <link>http://richyen.com/postgres/2026/04/06/wal_archiving.html</link>
        <guid isPermaLink="true">http://richyen.com/postgres/2026/04/06/wal_archiving.html</guid>
        
        <category>PostgreSQL</category>
        
        <category>postgres</category>
        
        <category>replication</category>
        
        <category>archiving</category>
        
        <category>log_shipping</category>
        
        
        <category>postgres</category>
        
      </item>
    
  </channel>
</rss>
