Skip over navigation

The Lost continent of

You've found a bug on my site!

Only two things are infinite, the universe and human stupidity, and I'm not sure about the former.

Albert Einstein

Downloads, Backups, and RSync.

When you absolutely, positively have to have backups...

Tape Drives are, like totally, so pasé...

In this ever increasingly connected age, why would you risk your data to the vagarities of magnetic tape? Thanks to modern tools, it is now both easy, and efficient to make off-site backups (even with large amounts of data) over a network, or even the Internet.

You could easily have a few Gigabytes of important data to backup. Obviously you wouldn't want to have to copy all that over any sort of network connection on a regular basis. It is easy to imagine an improved scheme, where we only copy files that have changed since the last backup

But we can do even better than that...

Introducing RSync

Andrew Tridgell, who now works on the Samba project, wrote a utility, RSync, that only copies the parts of a file that have changed - as a result, it's FAST.

RSync's advantages illustrated...

The savings in Network traffic can be quite incredible. I regularly backup my home folder to our webserver over a 128kbs connection — I have 40,000 files, some 2GB worth. A synchronization with no files needing to be transfered finishes in under a minute!

The diagram to the left is a (only partially contrived) illustration of the amount of data needed to be moved in order to take a copy of three big files I might have to backup to a remote machine. If we simply copy the lot, we have to copy a couple of hundred MB. No problem on a Local Area Network (LAN), but not really a good idea to do over the Internet on a regular basis...

If we already have a older, out of date, copy of these files on my remote machine, we could just copy the one file that has actually changed (pink), by comparing at the modification dates on the local and remote copies. This slashes our data bill by a third, but unless we have a fast connection, it's still going to take a while.

Enter RSync. Now we only have to transfer the parts of our file that differ between the local and remote copy. We could now do this 200 MB backup over a slow old dial-up connection in a few minutes!

Backups using RSync

First, some background. I run my company's webserver off-site, physically within the server-room of my local ISP. I need to make backups of the websites (which my clients update themselves), and the contents of the SQL databases on the server. Obviously, I need to do this remotely, and securely.

It's possible to run RSync as a standalone server. This is a perfect way to serve large collections of public documents, but not for our present needs. I'd rather not have to open another port on the webserver just to get backups.

What we will do instead is to 'tunnel' everything through SSH (which I'm using to administer the server remotely anyway), using the following command:

rsync -avzP -e ssh root@www.myserver.com:/tmp/backup .

This command will copy everything under /tmp/backup on the remote system (www.myserver.com), to the current directory on your local machine. The '-e ssh' option tells RSync to use ssh as it's shell (not needed in recent rsync versions), the '-z' option compresses the data stream, the '-v' option (verbose) shows us what is being transferred, 'a' option (archive) tells rsync that we want recursion and want to preserve all metadata, and finally, the '-P' option shows the progress of large files (and holds onto incomplete transfers).

If you want to run this command as part of a script, for inclusion in a cronjob for example, you should read my article on using SSH with public key authentication so that you don't need to type in a password every time.

Tips

  • Keep it simple
    A good backup strategy is easy to follow. Instead of rsync'ing an entire directory tree, I generate .tar archives of the files and folders I want backed-up into a temporary folder (/tmp/backup), and rsync them from there to ease management of the backups.
  • Beware compressed files
    In the tip above, I say to generate .tar files, and not .tar.gz files because, counter-intuitively, the gziped archive would take a lot longer to update. This is because of the nature of data compression — a single bit changed near the start of a file will result in a completely different compressed file, which rsync will have to copy in full.
  • Automate it
    After a couple of months even the easiest backup system becomes a strain to use. Add an entry or two to your crontabs to backup automatically, but don't forget to check up on things regularly.