The Curious Case of Split WAL Files
Introduction
I’ve come across this scenario maybe twice in the past eight years of using Streaming Replication in Postgres: replication on a standby refuses to proceed because a WAL file has already been removed from the primary server, so it executes Plan B by attempting to fetch the relevant WAL file using restore_command
(assuming that archiving is set up and working properly), but upon replay of that fetched file, we hear another croak: “No such file or directory.” Huh? The file is there? ls
shows that it is in the archive. On and on it goes, never actually progressing in the replication effort. What’s up with that? Here’s a snippet from the log:
What’s going on?
I haven’t personally been able to reproduce this exact scenario, but we’ve discovered that this happens when a WAL entry is split across two WAL files. Because some WAL entries will span two files, Postgres Archive Replay doesn’t internally know that it needs both files to successfully replay the event and continue with streaming replication. In the example above, an additional detail would be to look at the pg_controldata
output, which in this case looks something like this: Minimum recovery ending location: 51/4DFFED10
. So when starting up the standby server (after a maintenance shutdown or other similar scenario), it clearly needs file 0000000A000000510000004D
to proceed, so it attempts to fetch its contents from the archive and replay it. It happily restores all the relevant WAL files until it reaches the end of the 51/4D
file, at which point it can no longer find more WAL files to replay. The toggling mechanism kicks in, and it starts up the walsender
and walreceiver
processes to perform streaming replication.
When walreceiver
starts up, it inspects the landscape and sees that it needs to start at 51/4DFFED10
for streaming replay, so it asks walsender to fetch the contents of 0000000A000000510000004D
and send it. However, it’s been a long time (maybe lots of traffic or maybe a wal_keep_segments
misconfiguration) and that 0000000A000000510000004D
file’s gone. Neither the walsender or the walreceive know that LSN 51/4DFFED10
doesn’t actually exist in 0000000A000000510000004D
, but it’s actually in 0000000A000000510000004E
, and 0000000A000000510000004D
is already gone, so it wouldn’t be able to scan it and find out that 0000000A000000510000004E
is needed.
A possible solution
In one of the cases I worked on, the newer file (0000000A000000510000004E
in the above example) had not been filled up yet. It turned out that it was a low-traffic development environment, and the customer had simply needed to issue a pg_switch_xlog()
against the primary server.
Of course, this isn’t a very reliable solution, since it requires human intervention. In the end, the more reliable solution was to use a replication slot, so that Postgres always holds on to the necessary WAL files and doesn’t move/delete them prematurely. While streaming replication slots have their pitfalls, when used properly they will ensure reliable replication with minimal configuration tweaks.
Special thanks to Andres Freund and Kuntal Ghosh for helping with the analysis and determining the solution