Darwinweb

Rsync for Backups and Synchronization - Part 1: Overview

June 26, 2006     

Backing up your data is probably the single most overlooked necessity in computing today. Over the years I have lost a lot of my work periodically, mostly due to laziness. Since 2000 I have been lucky and haven’t lost much of anything. When I quit my job and started freelancing full-time last September, it became clear that I needed to get serious about backups. A few months ago I finally implemented a simple backup solution for my Powerbook. Since then I’ve also bought a second computer and struggled with the issue of synchronization as well as backups. Today I finally have set up network-based backups to Strongspace in addition to my local HD backup. Peace of mind at last.

Why did it take so long for me to get my shit together? Well honestly, backing up is not a simple problem. Especially when doing network-based backups with limited space. Simply mirroring my HD or even home directory is too time-consuming to be practical. I wanted something automated, robust, inexpensive and efficient, which meant really getting my hands dirty. The good news is that I was able to accomplish everything I need using free software, which is a big relief, because commercial solutions tends to be opaque, less scriptable and less interoperable than good old fashioned unix tools.

Over the next few weeks I’ll be sharing some of the information and scripts I’ve developed for backup and syncing purposes, which will hopefully save you some time if you decide to set up a similar solution.

File Copying Methodology

The idea of backing up and syncing essentially comes down to copying files around. If we always use a single computer, and we have sufficient bandwidth than backing up is a trivial problem. However, in the real world we need smarter tools than a simple copy utility. Conceptually I break it down into 3 levels:

cp / scp Basic copying utilities are a good place to start, but fall short for incremental changes in low-bandwidth situations. In other words, no good for network backups of large bodies of data.

rsync The rsync algorithm solves the problem of efficiently transferring changes between files. This makes it great for incremental backups. The standard utility also has functionality that can help with synchronization, such as comparing file modification times and deleting non-existent files. Even so, rsync should be considered a hack for true synchronization, and as a user you should be very very careful.

version control systems For software developers the problem of synchronization is serious and requires the ability to rollback changes as well as reconcile conflicts between multiple developers. Systems such as CVS or Subversion offer the minimal functionality needed in this realm, whereas certain types of projects require more robust abilities such as merge-tracking. Theoretically any such system gives you all the information you need to manage synchronization effectively.

Is Version Control Necessary?

In short, version control makes synchronization easy. Because old versions are saved and local changes are tracked, you are pretty safe from overwriting data. On the other hand, version control can eat up a lot of disk space over time, and can be too slow for large data sets. In general version control systems just aren’t optimized for the problem. Ideally there would be some solution that combines rsync’s efficiency with basic history-less versioning. Something that provides the merging capability without the large overhead of a version control system. Perhaps such a solution exists, but I’m not aware of it at this time.

Without the notion of versions we are required at all times to maintain a canonical version of every file. For a single user this is usually doable and can be managed automatically through rsync with the use of file modification times. There are some pitfalls with this approach, such as preference files that are ‘modified’ every time an application quits, or with symbolic links that rsync doesn’t handle perfectly. You also have to be very careful about file deletions and how to propagate them safely to the repository. Depending on your workflow, you may or may not be able to work around these problems. In my case I’ve resorted to some additional help such as .Mac syncing for certain preferences, but have largely avoided general version control.

The Setup

I’ve got a Mac Mini running as a server with an external hard drive attached which serves as the primary backup and sync repository. Keeping the canonical copy of my data on a separate hard drive simplified synchronization because I don’t have to worry about local changes. The repository only changes when I explicitly decide to up-sync from one of my computers. My primary computer is my Powerbook G4 which I periodically up-sync to the external hard-drive. Finally my Strongspace account provides the necessary off-site backup of my Powerbook.

Currently I’m using a set of ruby scripts to manage the functionality of rsync safely. Backing up and synchronization use significantly different rsync commands, but by using ruby scripts to wrap the functionality, its very easy to tweak them according to specific needs. Next week I’ll present my current synchronization script and delve into its design process.

Part 2 >>