When Your Backup Server Shares a Disk With the Thing It Backs Up

When Your Backup Server Shares a Disk With the Thing It Backs Up

Backups feel done once the jobs are green. But “the jobs succeed” and “I can actually recover from a failure” are different claims, and the gap between them is usually a shared dependency you stopped seeing. Here’s one I found in my own lab, and how I split it apart.

The setup that looked fine

Proxmox Backup Server runs as a VM. It has two datastores, both NFS mounts:

  • pbs01-backups — the daily chain, NFS from a NAS at 192.0.2.1
  • pbs01-archives — long-term monthly retention, NFS from a different NAS at 198.51.100.1

Two NAS boxes, two datastores. Looks nicely separated. Daily job [REDACTED] ran green every night. Done, right?

The dependency I’d stopped seeing

The PBS VM’s own boot disk also lives on 192.0.2.1 — the same NAS that backs pbs01-backups.

So 192.0.2.1 is a single point of failure for two things at once: it stores the daily backups, and it’s where the backup server itself boots from. Lose that one box and you don’t just lose the daily chain — you lose the server you’d use to restore from any chain. The monthly archives on 198.51.100.1 survive, but the machine that knows how to read them is down until you rebuild it.

That’s the trap with shared storage: the failure domain isn’t “a datastore,” it’s “everything that touches this disk.” A green backup job tells you writes are succeeding. It tells you nothing about whether one hardware failure takes out both your data and your recovery path.

Failure isolation: spread the blast radius

The fix isn’t more backups — it’s making sure no single storage failure can take out both a backup copy and the means to restore it. I relocated the daily job to a third, independent NAS appliance, separate from both the PBS boot disk and the archive store.

# Point the daily datastore at an independent NFS export on a separate appliance
pvesm add pbs01 pbs01-daily \
  --server 203.0.113.1 \
  --datastore pbs01-daily \
  --fingerprint <pbs01-fingerprint>

Now the three things that must not die together — the backup server’s boot disk, the daily copy, and the archive copy — live on three different boxes. A single appliance failure degrades one of them, never the recovery path as a whole.

The general lesson

When you audit a backup setup, don’t ask “are the jobs succeeding?” Ask:

  • What’s the failure domain? Draw every backup copy and every component needed to restore (the backup server, its boot disk, its catalog). If two of them share a disk, an array, or a host, that’s your real single point of failure.
  • Does losing one box take out both a copy and the recovery path? If yes, the second copy isn’t really a second copy.
  • Have you tested a restore with the primary store offline? That’s the only test that exercises the dependency you can’t see.

3-2-1 is the slogan. The substance is failure isolation — and the dependency that bites you is almost always the one you’d stopped noticing.