Refactoring My Backup Process

A couple of weeks ago, I decided to spend a few hours on a Friday afternoon improving my backup process for my Blackfly managed hosting service. Two weeks later, I've published my ongoing work as an update to my backup-rsync project and have decided to share it with you.

You might think I'm trying to compete for "least click-bait like title ever", but I'm going to claim this topic and project might be of interest to anyone who likes to think about refactoring, or who is implementing backups for container-based hosting (like mine).

Definition

"Backup" is one of those overloaded words in both vernacular and computer-specific use, so I want to start with definitions. Since "a backup" is amongst the least interesting objects (unless it contains what you absolutely need in that moment), I think it's more interesting and useful to define backups functionally, i.e.

A "backup process" is a process that

1. provides a degree of insurance against machine and catastrophic failures
2. provides a tool to revert and recover from human error
3. enables the generation of site copies for the purpose of staging and development

In other words, it's a process defined by it's purpose.

A "backup" is an artefact of a "backup process", and i'd say a good backup process is likely to generate more than one "backup".

This definition is maybe more encompassing than a traditional or generic backup service, which will focus on 1., and provide some degree of 2. and 3. if you're lucky.

Background

I've been running my Drupal and CiviCRM managed hosting service using containers for about 5 years now. One of the delights of container-based hosting has been the built-in distinction between "infrastructure" and "code and files".

For example, if you're using a generic virtual machine to host your website, then there's no inherent distinction, at the file system level, between the infrastructure code (e.g. apache, it's logs, and configuration files) and the site specific code and files. Which means, if you really want to ensure that you can recover reliably from a machine failure (or generate a copy of the site), you'd need to backup "everything", or else keep track manually of how you built it (your "infrastructure configuration"), and know where your "code and files" live (usually, that's in your document root, but nowadays there's often important stuff outside of that).

With container-based hosting, your "code and files" are the "persistent" part of your application that gets stored in a "volume", and the rest of it is defined by the image (which is generated by a Docker file). So that distinction has already been done and baked in as part of the application design.

Which means that there's a canonical way to backup a docker application, i.e.

1. Make sure you've got your images maintained, built and "composed/orchestrated" in a reproducible way.
2. Backup your volumes.

In theory, if you can do that, you can reliably reproduce an application instance ("a site") by generating the application from the orchestration file and associated images, and then restore the volumes from your backups.

If you're confused by the composed/orchestrated piece, it might be helpful to understand that a Drupal site built in the Docker Way has it's "orchestration" configuration in a "docker-compose" formatted file - that file then refers to the different images that it needs. But for the purpose of this article, the key is that the application's persistent data lives in volumes.

Rsync

Rsync is an older cross-platform utility program for efficiently backing up large files systems that don't change much.

It solves the problem of wasted bandwidth and storage that a more naive backup system might implement. By comparing the backup source with an existing, previous backup, it can provide an incremental backup that only updates what it has to, so that you can keep a single, updated copy of your source on a remote backup system.

My Original Backup Process

My original backup process followed the generic advice for backing up volumes via rsync. It looks more or less like this:

For each volume that you want to backup, mount it into a dedicated rsync-included "backup processor" container and then mount the backup destination into that container, and then run rsync. Here's the line from my original backup script:

docker run --rm --mount source=${VOLUME},target=/backup-source,readonly --mount type=bind,source=$BACKUP_DIR,target=/backup-dest blackflysolutions/backup-rsync bash -c "rsync --delete --exclude \"files/css/\" --exclude \"files/js/\" --exclude \"files/civicrm/templates_c/\" $BACKUP_EXTRA -azd /backup-source/ /backup-dest"

But that's inadequate for a number of reasons. The main one is that keeping a copy of the volume on the same machine is going to help with purposes 2.+3., but won't handle a machine failure or catastrophic failure. So I also wanted to push copies of that volume to other machines, and to external cloud backups that were not tied to my primary machine provider.

So, that was working well-ish, but my script gradually got more and more gnarly as I dealt with slightly different types of sites, using different volume strategies.

Specifically, you can build your Drupal and/or CiviCRM code into the image and then just mount a volume under sites/default (that's the canonical way to use the Docker official Drupal version), or you can include the Drupal and CiviCRM code in the volume and just include the required Drupal/CiviCRM "infrastructure" (php w/ appropriate modules, etc.) in the image (that's the "Pantheon" approach).

After a number of years of going back and forth about which is "better", I've concluded they're both worthwhile approaches and now maintain and support them both.

As an extra wrinkle, a couple of years I started including Wordpress in my hosting offerings.

For each of those different strategies, there will be different places to look for the cache files you want to exclude, and more significantly, the distinction between "files" and "code".

And finally, what pushed this project to the top of my list, was that I wanted to be able to configure different hosts differently, i.e. to which other machines and external backup services I wanted to send these backups.

Goal

For my new backup process, I wanted to take a core collection of business logic that had been embedded in an organically-growing bash script and replace that with code in the "backup processor" container using a standard configuration formatted file.

That would mean that the backup script on the host only had to deal with deciding which volumes to backup and not about the rest of the nitty gritty.

For my configuration format, I picked "json", not because it's better than the "yaml" alternative (it's not), but because I was familiar with the docker hub official bash scripts that use json configuration files, and I didn't see any equivalently supported equivalent for reading yaml files in a bash script.

Configuration

A good configuration file format has enough flexibility to deal with variability of both the present and the future, and at the same time, stay legible and editable so that it can evolve with a minimum of human error and distress.

In other words, the hard part is getting the abstractions right.

What I've ended up with is the following concepts, as can be seen in the example configuration provided with the project.

Recall that a configuration is per (backup source) volume.

Top level = "named backup process".

In the example, you've got "local", "remote" and "gcloud".

In my more elaborate real-world configurations, I've been breaking these up into "local-code", "local-files", "local-civicrm" (and equivalent for remote and gcloud), so that I can better support their use for staging/development copies.

The reason for this more elaborate configuration is that with really big sites, I don't want to copy all the files to generate my devel copy, I'd rather use the file staging module. For a site with 91G of files, that's not an insignificant issue!

The other reason for the elaboration is because although it's great to have a local copy of the most recent code (especially if it gets edited by mistake and you just need to grab yesterday's copy), keeping a full local copy of 90Gb of files is not very useful, and it's pretty wasteful!

Second level

For each of those named backup processes, we'll want to run some version of

rsync options source destination

The destination will be dependent on the backup process, but the source is mostly the same in each case (i.e. the volume that get's mounted into /backup-source), but might depend on "volume type".

I haven't defined "volume type", and it can be anything you like as far as this configuration file is concerned, but I use it to handle the different types of volumes used for sites that mount the volume in /var/www/drupal and include the drupal code in the volume vs those that mount the volume in /var/www/drupal/web/sites/default and include the code in the image. And now also to handle Wordpress sites.

I could just push that logic to the host backup script, with different configurations for each of the different types, but it's nice to manage fewer configuration scripts than a sprawl of different ones.

So the keys of the second level are:

i. process: i.e. "rsync -adz"

This one has to be the same for all "types".

ii. destination: a directory or process-specific "target" for the backup.

For the google cloud backup process, this would be a "bucket" target something like gs://bucketname/$VOLUME

Notice that the values in the configuration can contain expandable variable names.

iii. subdirectories: these are appended to both the /backup-source/ and the destination to get the actual source and destination arguments passed to the process.

This one is 'type specific' with whatever key collection you like - the key used is whatever gets passed into the container's entry point as the second argument.

iv. options: these are also type specific. There's a slightly hackish feature which is that you can set an option to "ignore" if you want to exclude that backup process - type pair. For example, since I don't include civicrm in my wordpress sites, I ignore the civicrm specific backup processes for type "vwp" (the type of volume I use for wordpress sites).

v. report: by setting this, you can use this script to tell you how much space the target is using. Typical entry would be "du -sb $destination".

vi. initialize: in a case where you're using subdirectories with rsync, you need to make sure that target directory exists before rsyncing to it, or you'll get an error. This initialize function allows you to define that, e.g. "mkdir -p $destination"

vii. comment: this isn't a real entry, it's ignored by the script, but it's probably the most useful and important one anyway.

The kernel of my home office

Search This Blog