Darwinweb

Rsync for Backups and Synchronization - Part 2: Synchronization Planning

July 5, 2006     

A good synchronization solution is very personal. The corporate (office) solution is usually simple network storage, and let the users be responsible for downloading and uploading from various computers. It’s not that this is a good solution, it’s just that anything more would be too heavy-handed. Successful file management in this situation revolves around designating a canonical version (usually the network drive), and then manually keeping track of local files that you’re editing. This is actually not too difficult when your work is mostly Word and Excel documents. If you’re a programmer or web designer working with a ton of files it’s basically impossible to do without some level of automation. Rsync gives you the ability to copy changes over large data sets efficiently, particularly when changes are sparse as is almost always the case with markup and code. Unfortunately typing rsync commands in ad hoc is a recipe for disaster. A little extra time setting up formal scripts will pay huge dividends over the long term.

Warning — Make sure you have a recent backup of everything before you start messing around with rsync. Even if you’re careful, a backup is extra insurance that will make you more confident about dialing in your rsync parameters.

Assumptions

The first thing you need to make my approach work is a drive to keep the canonical copy of your work. In my case this is an external drive, but it could just as well be a partition, or even just a folder on one of your machines. I have a general rule of not touching the canonical copy. While not strictly necessary, this is valuable from a troubleshooting perspective because it means any changes to that directory are a result of your scripts.

I also don’t recommend using an existing live-data store as the canonical copy. One reason is that conflict resolution becomes more difficult. With a separate canonical copy you can at least always see the data from the last sync. Then if you change the data on more than one machine you have a common history to refer back to. The other big reason is that it’s good to see exactly what you’re syncing. Live copies tend to have machine-specific files that you don’t want to sync. The canonical copy lets you see explicitly what was and wasn’t synced. With a live copy it’s not always immediately apparent whether a given file was touched by your script or not.

I do all my work on Mac OS X boxes, and I keep them all up-to-date. While I’m a firm believer that a software monoculture is bad, in this case it makes life a hell of a lot simpler. For instance, I can assume all filesystems are structured the same. The existence of standard unix tools also to my benefit when writing synchronization scripts. On the downside, Mac OS X has meta-data that should be preserved for synchronization purposes, and the existing OS X rsync (as of 10.4.7) has bugs that need to be manually patched before a truly robust synchronization system can be built.

Types of Files

For backups, full-drive mirrors are the norm. For synchronization, I’m more interested in the files I’m working on. Trying to make system files synchronize is madness anyway. Each machine needs to have it’s own copies of applications, system settings and logs. While it does take significant time to install all my standard software (especially unix software from source), rsync can not solve this problem. So I start with my home directory.

Excludes

The Library is the first thing that needs to be excluded. The problem here is that things in the Library tend to be overwritten automatically. Our synchronization scheme depends heavily on file modification times. For instance, some programs save the preference file every time a program quits even if no preferences were changed. Even if you wanted to synchronize preferences, you would have to be very careful that you quit all programs and up-synced before moving to another computer. This places a high burden on the user for very little reward (how much time does it take to customize your preferences?). The last thing you want to do is make synchronization complicated. The other problem is that the Library is a bit of a black box. There could be machine specific stuff in there (like software registrations tied to machine serial numbers, for instance). New programs are constantly writing who-knows-what files in there.

You will likely want to include certain files from your library explicitly. Probably the most important for me is the TextMate bundles. I’ve spent countless hours writing custom macros and tweaking the TextMate bundles to optimize my workflow. All these changes are stored in ~/Library/Application Support/TextMate, and I definitely need them to be available on all my machines. Other examples I sync include NewsFire subscriptions and Wallet entries.

I should also mention that a number of these things are syncable through .Mac, and by extension, MySync which is a more reasonably-priced way to get the great syncing features of .Mac without the expensive subscription. As long as you have an always-on Mac that you can use as a server, MySync offers all the functionality of .Mac sync for the one-time cost of a license (currently free during the public beta period). This is a true synchronization solution that can handle specific data formats much more robustly than a generic solution ever could. It’ll save you a ton of trouble for the data types it supports. Go check it out.

After the Library is handled, the rest of the excludes are more straightforward. Personally I exclude all the default directories except Sites. The reasons vary, here’s why:

Desktop This is where I download files. They tend to be large and temporary. I want the syncing to be fast, and the discipline of moving permanent files to a more appropriate location keeps my desktop clean.

Documents A lot of applications install stuff here automatically. I don’t want that stuff propogating around. Documents is too generic anyway.

Movies I put downloaded media here, but I don’t need that on every machine. It’s big and cumbersome so I just skip this directory.

Pictures My iPhoto library is in here. It’s big and I don’t need it on every machine. It definitely needs to be backed up, but not synced.

Public I rarely use this directory, but it’s kind of a server thing, and not all my machines are servers.

Music Much like Pictures I just can’t justify using 40GB of disk space for the same music on every single machine I own. I keep this at home, and if I want music on the go I bring an iPod.

You might think this eliminates most of my files, but mostly it just eliminates the large non-work-related files. Everything else is in Sites or stored in a series of custom directories in my home directory.

Finally there are a slew of common files and directories that just don’t need be synced for varying reasons. I’ll leave the why as an exercise to my readers: .DS_store, .Trash, .ssh, cache, Cache, templates_c.

Outside My Home Directory

As a web developer there are a handful of extra files sprinkled throughout my system that represent a lot of work and should be synchronized. Because these files are in potentially sensitive locations and because they are few in number, I don’t automatically synchronize them. Instead I synchronize symlinks to them. It would be possible for a secondary rsync command to synchronize the actual files after the symlinks themselves are synchronized, but I avoid this because some of these files contain potentially machine-specific configuration. Since I have the same username on all my machines, I could get away with it, but its just a little too brittle for my taste. The files I have symlinked are httpd.conf, /etc/hosts, and php.ini. Even though I’m stuck manually updating them. Having them symlinked is a handy way to access them for quick edits. Combined with one-way update to my sync repository, I can easily do the manual update of these files when the need arises.

The Syncing Semantics

As I alluded to earlier, the syncing I do is a user-administered process. Basically when I move from one machine to another I need to remember to upsync and downsync, at least if I’m going to be modifying any of the same files. Due to the structure of the files included and the use of rsync, this process should run quickly and not become a hindrance. I just have to remember to do it.

The criteria for my rsync command are pretty straightforward:

  • It should preserve file modification times, users, groups, symlinks, and Mac resource forks.
  • It should only overwrite newer files.

These criteria are easily met with some simple rsync flags. However there is one snag. We need to delete files in order to keep the workspace tidy. The problem is that deleting files is also dangerous. Sometimes you may not be completely sure of the status of the canonical copy or your local copy, so you’ll want to synchronize without deleting first, then perform your delete and synchronize it all at the same time so you are aware of what’s going on.

In order to set this up I wrote a ruby script that allows you to perform both up and down syncs with or without deleting non-existent files, and also with the option to do a dry run to see which files would have been deleted before going through with it. Although this is a touchy area and my solution still requires some discipline, this script has proven good enough for me in every day use. It’s reduced the mental overhead enough to make synchronizing more practical than sticking to a single machine.

Next Week

Although I was going to present my actual scripts this week, I’ve decided to hold back. I want to make some modifications and do some further testing before releasing them into the wild. Also, this article is plenty long enough at the abstract level. Next week I’ll present the scripts and get into the nitty gritty of rsync excludes which deserves a discussion all of its own.

<< Part 1