PRIMER: Redundant Data Backup in the Cloud for Linux

June 25, 2014 12 min read Howto marco

Here is my problem: when I travel, I want to have access to my most important documents (like my passport, or my drivers license, or the airline tickets, or a bunch of other things). I want to have access to those documents particularly if I lose my computer and the originals. Imagine I am in Fiji on a surf trip (I wish!) and my everything gets stolen while i conquer Cloudbreak. I come out, have nothing, and have nothing to prove that I once had something.

I could store documents online, of course. But then I have to deal with security issues. What if someone gains access while I am not watching? Can I trust the company that stores them to do so securely? What if the company runs out of business? What if a hacker locks me out of my account, and I have no way to get back access?

Also, once I come up with a system that stores my emergency documents safely and securely, what about non-emergency documents? Can I find a system that stores things securely and safely, but can be updated constantly? Is there a way to have files saved online that doesn’t jeopardize their integrity?

I figured out the way, and now I am replicating my sensitive documents online, trusting my experience and not any company’s promise. And this article is a howto on how you can do so, too.

On the most fundamental level, I started with file synchronization tools like Dropbox. In case you don’t know how that works, you essentially create an account with a company (say, Dropbox). Then you designate a folder to be synchronized. From then on, dropbox (the driver software) will copy all files in that folder to the Dropbox account, and vice versa.

Dropbox is very useful for many things: you can copy files from many different computers and keep them all in sync. That’s what I do with my configuration files: they are stored in a dropbox folder and symlinked to the home directory. When I change settings on one computer, that replicates to all my computers. Really nifty. Also, you can keep your picture collection in sync – particularly useful now that Dropbox allow for instant upload of smartphone pictures.

Dropbox, like many other sync tools, makes promises about encryption. Some people have variously argued that Dropbox can’t be serious with their promises. In any case, you wouldn’t want to trust a third party with encryption of your data. Not because they don’t know how to handle it, but because you don’t know how they handle it. You are better off completely ignoring their encryption capability and only syncing (important data) you encrypted yourself.

Each service does its syncing slightly differently. But they all start from the same premise: some location on your file system is marked to be synced, and while you use it at local speed, it is synchronized in the background. This is different from other, remote file systems like NFS or sshfs, where the files are not stored on your local system, ever. They are, instead moved back and forth from your computer/laptop to a server.

This particular setup will not work with remote file systems. Well, that’s a bit overstated: it may work, or not, depending on circumstances. I certainly advocate against it.

Let us first concern ourselves with data safety. In this context, that means that if we have a failure of some sort, we can still get our data back. Usually, by failure we would mean a hard drive failure, or a stolen laptop. In this case, those are included, but we want even more data safety.

Notice how I am talking of safety and not security. Those are radically different: data safety concerns itself with making sure it doesn’t get lost, in part or totally; data security concerns itself with making sure unauthorized parties have no access to it. Think of it like this: data safety is about ensuring your access; data security is about ensuring no access to others.

Synchronization is a first good step in safeguarding your data. The principle here is that your data is stored in two different places. If you lose one copy – to hardware failure or theft – the other is still there. That’s much better than a backup on a local hard drive, because if something happens to your computer (for instance, a fire), it can easily happen to all other equipment nearby. It does you no good to have a daily backup of all your important documents, if it burns down with the originals, right?

I should also mention that synchronization by itself has a huge flaw when it comes to data safety: it is not safe from you. If you delete a file or modify it, all copies will have the same change applied to them. Once a file is gone, you won’t be able to retrieve it. If that’s important to you, you need to find a file versioning system that keeps track of old versions. I wrote one of those in a different article: it uses the file versioning system Subversion. You can easily use it on top of this setup.

You should think about using versioned filesystems, though. Once you turn on versioning, all copies of your files will be retained, no matter how much you’d like them to be gone for good. For some purposes, that’s what you want. For others, though, it might not.

There was a time when hard drives dropped in price tremendously, while old backup solutions (typically on tape) remained relatively expensive. So, someone decided it would be a good idea to use bunches of inexpensive discs to build large storage arrays. The problem there was those inexpensive discs would fail after a fairly short time, so they had to design a way to be able to survive the death of a disc with the data (Wait For It!) safe.

What they came up with was a standard, which was named after the technology they were designing for: the Redundant Array of Inexpensive Discs. Short: RAID. In a RAID array (I know, it’s overkill, since the A in RAID already stands for Array) you can survive the sudden death of any drive, replace it with a different drive, and go merrily on. RAID was not just a system to store data, you see: it was also a specification on how to rebuild the data if a part of the array had died.

The system is very complex and interesting, and it defines all sorts of ways to deal with drives. The most obvious one is called RAID 1, or mirroring: your data is written to two identical drives, twice. It’s like having an instant backup. The one we are going to use, though, is RAID 5. In it, each drive contains a portion of the total data, but in such a way that each drive contains some data that is also on another drive. No data is only stored on only one drive. That way, if you lose one drive, you always know the data meant for it was also somewhere else, and you can reconstruct the data on that drive from the content of the others.

The reason we are picking RAID 5 over RAID 1 is that you can survive the loss of a copy in either of them, but that you need access to all other ones to get a full data set with RAID 5. So, we are adding a tiny bit of security to our system by choosing RAID 5, without losing any safety. Also, with RAID 5 we can split out data among as many providers as we like, while with RAID 1 we can only pick two.

So far, so good. But what does RAID (which talks about drives) have to do with Sync (which deals with files)? First, RAID arrays used to be typically controlled by special hardware. That was great, because the hardware not only took care of the data distribution, but made it also possible to “hot swap” a drive. If something went wrong and one of the drives died, the RAID controller would allow you to unplug the defective drive while the others (and the controller) were still running. Then you would be able to insert the new drive, and the array would continue responding to the host, while at the same time rebuilding the data on the new drive.

Some wonderful soul decided we don’t really need expensive hardware anymore, especially now that “inexpensive” has a completely new meaning when it comes to drives. The idea was now that you can just buy a bunch of USB drives (or FireWire), plug them into your computer, and tell your operating system to treat them as a RAID array. All in software, since modern processors are way too fast, anyways.

Another piece of the puzzle deals with the fact we don’t use actual hard drives, since we can’t synchronize them with our Dropbox. It turns out that Linux is incredibly powerful: it can make itself believe that a regular old file is actually a drive. You can do that with any old file: if you copy a CD verbatim to a Linux file, you can then tell the computer to think of it as an attached CD-ROM drive. That’s very useful, for instance, when you download a copy of a Linux CD and don’t want to burn it to a physical medium. Or when you manage to snag a single copy of a dying drive before it totally fritzes: you can mount that copy like it was the drive itself.

The technical term for using a file as if it were a drive (a block device, in Linux terms) is loopback, or loop for short. You can use any old file as a loopback device – it’s just that your average file will not look like much to Linux, and it will think you need to format it first.

In our case, that means we can create files that are synced, and at the same time that are part of a RAID 5 array. First, we create the files in the directories synced to the services we choose. Then we make those files into loop devices. And finally we combine the loop devices into a RAID 5 array. It sounds a little complicated, but it’s absolutely worth it. Especially because it’s one of those, “Set It and Forget It” kind of deals.

The next item on our list (well, stack) is optional. What happens if our beautiful RAID array runs out of space? In general, you’d have to create a new file system and copy all files from the old to the new. Since we want to keep our sync files small initially, though, we need to find a way to resize the whole thing without having to do the copy.

Linux has a powerful system that allows you to look at drives and partitions in a logical fashion. That is great, because that means that you can easily shrink and enlarge “partitions” (they are actually called volumes) after the fact. This is a little hard with our system, but adding the Logical Volumes at this point isn’t a performance concern, or a security issue, or a whole lot of complication

So far, we have a bunch of files that we told our Linux system to treat as the “drives” of a RAID 5 array, on top of which we created a giant-ish logical volume (partition).

Nothing we have done so far deals with security. It’s all been about safety and continuity. Now we need to make sure prying eyes have no access.

Fortunately, the almost infinite set of tools Linux puts at your disposal includes a cornucopia of security solutions. For our particular case, we will look at cryptsetup. That’s a tool that allows us to encrypt a whole file system: everything we write into it – everything! – is hidden by keys we choose ourselves.

cryptsetup, really, is a utility that hides the ugliness of direct setup. You can pick the backend you’d like to use – including Truecrypt, the solution that recently received a ton of coverage because its developers declared it done overnight, for no reason, leading to massive amounts of speculation in the security community.

In this particular solution, we will use LUKS – the Linux Unified Key Setup. It allows us to take the entire array and configure it as a single encrypted disk. The performance is phenomenal – you barely notice any difference – and the security is state of the art of the day. Of course, it will all depend on your particular keys.

Once you create the encrypted disk, to Linux it’s still an empty container. Now you pick whatever file system you’d like and you create it on top of everything else.

You could go a compatible route and pick VFAT or NTFS. Then you could use your encrypted RAID array on Windows. Or you can go for a mainly Linux file system type, like ext4. You could also go for something exotic like an ISO file system (what they use on CD-ROMs). It’s really up to you. Linux won’t really mind much.

After you create the file system, you need to mount it. That’s the same as would happen with any other drive you attach, and you need to put the same kind of attention to detail into it as you would with other drives. (Sadly, on Linux, that’s still a bit of a complication.)

Computer security people like to talk about the threat model. That’s the set of things you are worried about and against whom you want to defend.

Turns out it’s really important to realize what you are afraid of, because protecting against all possible things is nearly impossible. So you list everything you think might happen, and then you can focus your security/safety consideration around the things that might happen.

Our particular threat model includes:

Loss of hardware
Loss of a single sync provider at a time
Unauthorized access to your data on a single provider account
Unauthorized access to your data on your hardware while shut down
Unauthorized access to sync data “in flight” (while being transmitted)

It does not include:

Loss of multiple providers at a time
Unauthorized access to your data on your hardware while turned on or in suspended mode
Versioning of data for the option to retrieve old or deleted files

We would also like:

Fast, reliable access
Ability to grow the data pool