Why Your Cronjob Backups Are Silently Failing
The False Sense of Security
You set up a cronjob. It runs pg_dump every night. The file appears in /backups. Everything looks fine until the day you actually need to restore.
This is the most common disaster recovery failure pattern: teams assume their backups work because the backup process ran without errors.
Five Ways Cronjob Backups Fail Silently
1. Disk Full and Dump Truncated
Your backup drive fills up. pg_dump writes a partial file and exits. The file exists, has a recent timestamp, and looks normal. But it contains only 40% of your data.
2. Permission Changes
Someone rotates the database password or changes pg_hba.conf. The cronjob connects with the old credentials and fails. Cron sends an email to root, which nobody reads because the mailbox is full.
3. Schema Changes Break Restore
Your app team adds a new column with a NOT NULL constraint. The backup succeeds. But when you try to restore on a fresh instance, the restore fails because of dependency ordering issues.
4. Corruption Goes Undetected
A disk controller silently corrupts data blocks. Your database keeps running because PostgreSQL is resilient. Your backup captures the corrupted data faithfully. You now have 30 days of corrupted backups.
5. The Backup Never Actually Ran
The server was rebooted and cron was not re-enabled. Or someone edited the crontab and accidentally deleted the backup line. Nobody checks because the monitoring only alerts on backup failure, and "never ran" is not the same as "failed."
The Solution: Verified Backups
The only way to know a backup works is to restore it. Every single time.
BackupAgent does this automatically:
- Every backup is restored in an isolated Docker container
- Row counts are compared against the source database
- Schema integrity is verified
- Custom queries can validate business-critical data
- If anything fails, you get an immediate alert
How to Audit Your Current Backups
Run this checklist today:
- When was the last backup? Check the actual file, not the cron schedule
- Can you restore it? Actually try it on a test instance
- How long does restore take? This is your RTO
- How much data would you lose? Time since last backup is your RPO
- Who gets alerted if a backup fails? Is anyone actually watching?
If you cannot confidently answer all five, your backups are at risk.