I use restic for backups on Linux. After a cappuccino scare, I turned my attention back to automating a local backup of my everyday laptop. After a few days, I noticed that each quasi-daily backup was adding 200-500 MB of data, which seems excessive. I presumed that I should exclude more files from this automated backup, but I didn’t know which ones. I guessed that I should look for big files or big directories. But how do I get that information?

It turns out to be a combination of restic, stat, and awk.

A Solution

Please note that this solution works on zsh and apparently not on bash. I don’t know enough details to rewrite this in bash, so I can’t help you do that.

Also please note that I use rga, but regular grep would suffice.

restic $BACKUP_CONNECTION_DETAILS \
    diff $BEFORE_SNAPSHOT_ID $AFTER_SNAPSHOT_ID \
    | rga "^(?:M|\+)" | awk '{print $2}' \
    | while read f; do; (stat -t $f | awk '{print $2 , $1}'); done \
    2> /dev/null \
    | sort -n
  • Your $BACKUP_CONNECTION_DETAILS includes your repository address and password file.
  • The snapshot IDs are the ones to compare. I compared the most recent two, but just as a start.
  • stat -t reports files it can’t find to stderr, so I remove them. It shouldn’t be needed, but it’s there just in case.

This command lists the files modified and added to the backup, along with their sizes, sorted with largest files at the bottom. It gave me some clear ideas which directories to exclude from this backup.

The most recent backup added only 8 MB of data to the repository. That’s more like it.

What did I exclude? $HOME/.config/Slack and $HOME/.local/share/flatpak. I’m sure I should exclude more, but I don’t need to figure that out right now. If this becomes a bottleneck again, I’ll look into it.

I will also eventually forget old snapshots and prune the repository, which will remove the large files that I should never have bothered backing up in the first place. There is no rush.