HOWTO: Redundant Data Backup in the Cloud for Linux

June 25, 2014 21 min read Howto marco

After the previous article explaining the principles behind this form of “safe” cloud backup, here a step-by-step tutorial on how to make it work. The software used and the commands issued are all for Ubuntu, but you should be able to translate them into any modern Linux variant. On the other hand, much of the infrastructure required works only on Linux.

1. Prerequisites

Aside from the obvious (a modern version of Linux), you will need a series of tools that don’t come installed standard. First the actual commands, then an explanation:

sudo apt-get install mdadm lvm2 cryptsetup-bin

We are installing three packages:

mdadm: the package to control RAID arrays. From the description: “tool to administer Linux MD arrays (software RAID) The mdadm utility can be used to create, manage, and monitor MD (multi-disk) arrays for software RAID or multipath I/O.”

lvm2: the package required for logical volume management, allowing us to resize after creation. “This is LVM2, the rewrite of The Linux Logical Volume Manager. LVM supports enterprise level volume management of disk and disk subsystems by grouping arbitrary disks into volume groups. The total capacity of volume groups can be allocated to logical volumes, which are accessed as regular block devices.”

cryptsetup-bin: the package to encrypt the data. Technically, this is just a utility to manage the process, while the Linux kernel does the actual encryption, but to us it’s the same thing. “Cryptsetup provides an interface for configuring encryption on block devices (such as /home or swap partitions), using the Linux kernel device mapper target dm-crypt. It features integrated Linux Unified Key Setup (LUKS) support.”

As usual, you will be asked your password and you should see a string of output lines. To test if it all worked, just type:

mdadm

You should see something like this:

Usage: mdadm --help
for help

If you see something like:

The program 'mdadm' is currently not installed. You can install it by typing:<br></br>sudo apt-get install mdadm

then something went wrong. If you’d like, you can create an account on this blog and post a comment with your error message.

2. Sync Accounts

For this to work, you will need three accounts with three different synchronization providers. Lucky you, the list on Wikipedia is huge, and you can pick and try all day long. In the end, you want to have three directories on your system configured such that they each synchronize to a different account at a different provider. On my first try, I used Dropbox, SpiderOak, and Wuala, but I’ll complement this HOWTO with a separate comparison of the different options from the list in a later article.

For now, we will assume that the three directories synced are: ~/dir0, ~/dir1, and ~/dir2. Replace these names with the actual directories on your system – and consider that the file sync tools usually have a pretty good idea of where they want their files to live, so that you may not even be able to easily change the location.

3. Generate the Data Files

As explained in the primer, the infrastructure we are building is housed in three files, one each in the sync directories. It doesn’t really matter what’s in those files, since they are going to be overwritten anyway. For a modicum of added security, though, we will fill each file with random bits, so that nobody can guess the content. This is a painful process, since generating lots of randomness is strangely complex. As a result, this particular method may take a really long time for larger sync files.

Assuming we stored the location ~/dir0 from above in the variable DIR0 and we want to name the file file0.dat and give it a size of 100MB, we issue the command:

dd if=/dev/urandom of=$DIR0/file0.dat bs=1M count=100

dd is the Linux/UNIX utility to copy (parts of) files verbatim. In this case, it “copies” the “file” /dev/urandom (which is a random number generator) to the named file. Linux has an interesting series of generator devices like /dev/urandom. One that is much faster is /dev/zero, which simply generates sequences of bytes containing only 0s. If you care a little less about security and more about speed, you could use /dev/zero instead of /dev/urandom. The downside is that anyone who looks at your files would be able to tell what they are, because only the structures that are superimposed are non-zero. It’s like having a computer with a glass case: you can see everything inside.

It’s not as bad as I make it sound: your encryption is still going to be strong. But while with a random file you’ll keep people guessing as to what’s inside, starting with zeros gives people a much better clue as to what you did, exactly.

Now, if you built your data file using urandom, you need to build the two other data files the same way:

dd if=/dev/urandom of=$DIR1/file1.dat bs=1M count=100<br></br>dd if=/dev/urandom of=$DIR2/file2.dat bs=1M count=100

Notice that I changed the names of the files, too. You don’t have to do that, since they live in different directories. To rebuild, though, you may want to put them into the same directory at some point, so it’s a good idea to keep the names in sync with the directories. Also, the size of the file (100MB) is defined here as count 100 times the block size (bs) of 1M. The default block size is something stupid, 512 bytes, so you typically want to specify that explicitly.

If you generated the first file with /dev/zero, you can simply copy it to the other two directories. That’s going to be faster.

cp $DIR0/file0.dat $DIR1/file1.dat
cp $DIR0/file0.dat $DIR2/file2.dat

Do not copy the urandom file! The additional data privacy you gained by using urandom is gone if you copy the file over, since anyone can simply compare the two files later and see what changed!

4. Tell Linux You Want Those Files to Be Devices

For this purpose, we will use the losetup utility. losetup uses numbered devices, typically /dev/loopXXX, where XXX is a number starting at 0. Typically, you won’t have any of the loop devices used, and I will assume so for simplicity’s sake. In general, though, you can call losetup with the -f option, in which case it will find an unused device and return it. If you use the -f option, you have to carry the name of the device to the next step, when we merge the loop devices into our RAID array.

We start setting up our first loop device:

sudo losetup /dev/loop0 $DIR0/file0.dat

And then do the same for the other two. (From now on, I will omit repeating for the other members.)

This is a really simple command: it tells Linux that from now on it should treat file0.dat as the source for the device /dev/loop0. We could format the file, we could treat it as a hard drive, we could do whatever we would do to a drive. The power of UNIX compels you!

5. Now We Create an Array

Here you get to feel like Master of the Universe: you will create your very own RAID 5 array! The command is pretty straightforward:

sudo mdadm --create --verbose /dev/md0 --level=5 --raid-devices=3 /dev/loop0 /dev/loop1 /dev/loop2

mdadm stands for “Multiple Disc ADMinistrator.” Here were tell it to create (–create) the new device /dev/md0. We want verbose output, so we know what’s going on. The new device is going to be a RAID 5 array (–level=5) and is going to contain 3 devices (–raid-devices=3). The three devices are /dev/loop0, /dev/loop1, and /dev/loop2.

Since I am explaining things, I’d like to mention that:

–verbose is strictly optional
–level=5 is our choice. But if we had chosen, say, –level=1 we would have specified mirroring, and we could have only had two devices in the array. We could specify –level=6 (if supported) and then we’d have to have at least four devices.
Also, at level 5 we could specify any number of devices greater than 2. I picked three because it’s the smallest number possible, but you could add a fourth provider and a fourth file – even at a later date!

Next, we check that our RAID array is doing well by issuing the command:

sudo mdadm --detail /dev/md0

6. Time For Logical Volumes

If you are not concerned about later resizing your array, you can safely skip this part. Since it’s really not a lot of overhead for you, you really should try to read and understand, and setup your array this way. The benefits are going to be clear later, and right now there isn’t a whole lot of extra work.

If you find that these commands fail, though, you can safely skip over them. Just make sure you don’t use the name of the logical volume we created here, and instead use the name of the array device when it comes to creating the file system and mounting later on.

First, we generate a logical volume group on our new array:

sudo vgcreate encraid /dev/md0

Here “encraid” is a label we picked. Could be anything (within reason), but we will have to use the same label with the other commands in this article.

After the volume group, we create a single volume in that group. To do so, we need to know the size (extent) of the group:

sudo vgdisplay encraid

This gives us info. Look for the line that starts with “Total PE” (PE = Physical Extent). We will need that to tell LVM how much space we’d like to give our new volume. Say it’s 123; then we issue the command:

sudo lvcreate -l 123 encraid -n lvm0

Here, 123 is the number we got above; encraid is the name we picked when we created the volume group; and lvm0 is a random name we picked to refer to this volume.

Let’s check that everything is OK. You should have a file named /dev/encraid/lvm0 on your file system. If you issue the command:

sudo lvdisplay /dev/encraid/lvm0

you should see something that looks a little like the output of vgdisplay above and tells us all the gory details about our new volume.

7. How About Some Serious Encryption

At this point, we have two options: we either create and encrypted drive, or we create an encrypted file system. The former has the advantage of encrypting everything (including directory structure and file names), while the latter has the advantage (if correctly set up) to allow partial copies that are encrypted.

For various reasons, we’ll opt for the encrypted file system here. We will use the utility cryptsetup, which is a front-end for a series of encryption standards. The ones that matter to us are:

dm-crypt: the default file system encryption in Linux
dm-crypt/LUKS: same as above, enhanced by LUKS management
loopAES: a different Linux encryption standard
tcrypt: compatible with TrueCrypt/TCPlay drives

If you opt for loopAES or tcrypt, you have to generate the encrypted drive on your own. We’ll stick with the Linux default, LUKS.

To create the encrypted drive on our new volume, /dev/encraid/lvm0, we simply issue the commands:

sudo cryptsetup luksFormat /dev/encraid/lvm0
sudo cryptsetup open –type luks /dev/encraid/lvm0 enclvm0

The former formats the drive: it creates the structures (headers) necessary for LUKS to manage the encryption. You will be asked a bunch of questions, including a passphrase. You can pass a lot of options to the command, and you should read the manual.

In the default case, luksFormat will ask you for a passphrase. Read the notes below on passphrases!

The open command does what a bunch of other commands did in this article: it creates yet another named device. Again, you can name it pretty much what you’d like. Only that it will live in /dev/mapper, which means the name you pick cannot conflict with another file in that directory.

Once the open command is issued, you will have the device /dev/mapper/enclvm0 available. That one you can treat as a regular “hard drive partition.”

8. Picking a Passphrase

The greatest grief in the world is caused by passphrases. That’s for two opposing reasons:

They can be easily guessed
They can be easily forgotten

The problem with passphrases that can be easily guessed is that they don’t really secure your data. The problem is huge, because once your data is in someone else’s hands, they can try passwords all day long, millions of them a second. It’s not like the password to your computer, which requires typing and hence only a small number of tries a minute. If your data are compromised, you have to face cracking at industrial scale. Anything that is reasonably short will be cracked in a short time.

On the other hand, if you pick something too strange and long, you won’t be able to remember. That’s a huge problem, again, because there is absolutely no way to recover your data if you lose your passphrase. That’s particularly bad since the basic idea is to use this in case of an emergency, as explained in the primer. What do you do if all your belongings are stolen in Fiji, and you can’t remember your passphrase?

Of course, while you are mulling about your misfortunes at Cloudbreak and need to figure out your passphrase, you will also have to know the names of at least two of the three sync providers you used. That’s good, because it automatically provides hints for passphrase recovery. For instance, you could use the email address you use to log on to either to build a mnemonic. Say the address is email@mrgazz.com. You could take the email part and make it a sentence, like “every morning alex issues lashings.” That’s fairly long, and relatively memorable.

To make it better, you should add numbers, symbols, and uppercase characters as you see fit. For instance, you could take the number above the first letter of each word and put it at the end of the word itself: since above E on the keyboard there is the number 3, you would write every3. For symbols, you could do the same but add the symbol corresponding to the last letter in front. Instead of every3, you’d get ^every3, since Y is below ^. You could count the letters in a word and capitalize it if the number is even: every has 5 letters, so we don’t capitalize; alex has 4, so we write Alex.

Of course, this is not the scheme I use. Nor should it be the scheme you use. But you could use this scheme as a template to add seeming randomness to your passphrase. With the scheme above, email@mrgazz.com turns into ^every3%morning7@Alex1@Issues8@Lashings9. That looks totally random, but if you remember the scheme that got you there, you can remember the entire monstrosity fairly easily.

The advantage of a long passphrase is that it makes it very hard to simply try out all possible passphrases, which is called a brute force attack. If you take all numbers, symbols, uppercase, and lowercase letters, you have enough characters available that even short-ish passphrases become really hard to crack.

Of course, our passphrase is really long, but not really random. Someone with a little bit of intelligence could figure out that it’s a series of English words with “random” characters and capitalization. But you would have to have someone with intelligence to figure that out. That means you’d need someone that really wants your data, not just the average hacker. If you are a suspected terrorist and the NSA captured your data, you would be at a loss, and I am absolutely happy about that. But if your data falls into the hands of the Russian Mafia (just an example!) and they don’t have any reason to assume you are of particular importance, then you are fine.

9. Creating the File System

At this point, we have the equivalent of a hard drive partition. It’s still empty to the operating system, like when you plug in an SD card and the computer tells you it needs to be formatted.

We do that really easily by issuing one of the mkfs family of commands. Which one is up to you, and depends on what’s installed on your system. Popular choices are NTFS (the Windows default), VFAT (old Windows default), EXT2 (old Linux default), and EXT4 (new Linux default). Steer clear of FAT and MINIX, which are a little old in the tooth.

I should mention that, since we used mdtools, formatting the drive as VFAT or NTFS will not do much good for interoperability, unless the same mdtools are available on Windows. (They currently aren’t.)

Assuming you want EXT2, the command to issue is:

sudo mkfs.ext2 /dev/mapper/enclvm0

For other file systems, you would change that accordingly.

10. Mounting

And now we can finally access our precious new file system! We simply issue the command:

sudo mount /dev/mapper/enclvm0 /mnt/

and the drive is finally going to be available on the directory /mnt/.

Of course, the mapper is the one we just formatted. And you can pick any old directory you’d like for the mount. You should probably pick an empty directory, because all sorts of strange things could happen if you mounted the root, or /usr/bin, or some such. /mnt/ is there for mounting random odds and ends, you can count on it to be there, and on it being generally empty.

Permissions are controlled by options passed to the mount command, by the permissions on the directory (before it was mounted), and by the file system. Sadly, explaining that blows the scope of this howto. But you should now try to issue the following commands:

sudo touch /mnt/test.txt``` sudo ls /mnt/`

You should see the file test.txt in your /mnt/ directory, potentially along with a directory called lost+found.

If you issue the command to list the three data files into which we stored your encrypted file system, you should see them updated:

ls -l $DIR0/file0.dat $DIR1/file1.dat $DIR2/file2.dat

should give you a list of three files. If the time stamps are not current, try issuing the command sync before the ls. That should fix it. What happened is that one of the zillion of layers we created stored the changes without passing them on to the layers below. Sync makes sure that all layers are updated to their current state, as if you were about to turn off your computer.

You should also see how the three sync utilities start updating the file on the server side. Depending on the utility, this may be a brief process or a long one. See the comparison/shootout article for more details.

11. Script!

Since this was a scary long explanation, I added a script that automates this whole process. You should run it with root privileges (it will fail if you don’t). It will ask you a bunch of questions, some of them with defaults. The only ones that really matter are the initial location of the files, and the final mount point. The script will detect conflicts and work around them as much as it can, and will present you with a summary of its actions at the end.

You can find the script here.

12. Shutting Down

The sync command we encountered before is issued as a mandatory part of shutdown and suspension on Linux. That ensures you don’t have to worry about your data, since it is going to be stored before anything (regular) happens.

If you want to manuall shut down the array, though – to prevent access, for instance – you simply follow the steps above in reverse.

You unmount the file system: sudo umount /mnt
You delete the loop devices: sudo losetup -d /dev/loop0 /dev/loop1 /dev/loop2

You don’t have to worry about the mappers that appeared as a consequence of all the other work: they are either automatically deleted or become useless.

13. Remounting

When restarting your computer, or after a manual shutdown of the array, you need to start it again. To do so:

Create the loop devices
Assemble them into the RAID array (usually that’s automatic on boot, but since the devices don’t exist, the assembly will fail)
Open the LUKS container
Mount the mapper

This is simplified version of the steps above. The parts where structures are created (like the array, or the file system) are omitted. You can get the remap script here.

14. Advanced Topics

Are there alternatives to this setup?

Yes, and lots of them. For instance, losetup can encrypt on its own, using the -e option. This requires the kernel to have built-in encryption functions, which you should figure out yourself if you are so inclined. If you choose losetup encryption, you won’t need to encrypt the RAID array (the steps with cryptsetup can be omitted).

Also, and as mentioned above, you can skip the generation of volume groups and volumes. The main thing you lose is the flexibility to resize the array after the fact.

You can also not encrypt the entire drive and instead encrypt the files. For this purpose, you could use eCryptFS or encFS. The latter is particularly nice, because it is easily configured and allows to copy portions of the tree in encrypted form.

Another option is to choose a different RAID level. For instance, you could use RAID 1 and limit yourself to 2 providers. Or you could gain more redundancy with RAID 6. Or you could choose smaller sync files using RAID 10.

You can also decide that you don’t need sync utilities, because you are only updating from a single computer. In that case, you can use the rsync utility over SSH to sync to any server to which you have root access. Because of the security built-in to this setup, you can get one of the cheap providers ($10 a year or so) and distribute your data across them. This is particularly useful if you already have three server accounts, anyway, as I do.

What are the dangers?

[Warning: this list is not exhaustive!]

Aside from the security risks highlighted in the primer (under threat model), the main issue you might encounter is a version conflict. That happens when you have two arrays mounted on different computers and they update independently. While that’s not particularly dangerous as long as you are the one making updates, as soon as there are automated changed to the file system, you can get into the situation where the sync finds out the files have been independently modified and doesn’t know what to do. In the worst case, one of the sets of changes can be overwritten entirely.

Also, in case of an emergency, you need to be able to access a computer that can synchronize the data files and rebuild the array. That of course means you need access to a Linux computer. You essentially have to different ways of going about it:

You download a current copy of Linux and run it as a LiveCD. Then you install the additional packages and the sync software, and download two of the three sync files.
You get an SSH account on some remote server. There, you perform all the steps required.

I strongly suggest option 2., especially because you can prepare for it ahead of time. You can install the entire stack, up to the download of the sync files, and then complete the setup when you need it. The advantage is that you don’t risk leaving files behind on whatever temporary system you’d use. The disadvantage, of course, that you’d need to know how SSH works, how to use the command line, etc.

There is always the inherent possibility that you’d forget the passphrase. There is neither a foolproof way to store one, nor a foolproof way to remember one, so you’ll have to figure out something you are comfortable with. If you use a password manager like LastPass, that might be a good place to back it up. If you have access to a system with biometric access, that’s another option. If you use your sensitive files only away from home or the office, you could store an emergency copy there. In any case, you should probably not store the verbatim passphrase, but something the is guaranteed to remind you of the actual passphrase.

As in all cloud computing applications, you are toast if you don’t have a (fast enough) Internet connection.

If you use your email address as login credential and your email account is compromised, you might lose access to multiple of your sync accounts at the same time. The way this works is simple: your email account is hacked; the hacker has passwords reset on all accounts; the password change information is sent to the email account; you end up locked out of all sync accounts at once. It may be smart to use accounts with different providers, each with its own unique password.

There is a slight chance some or all the software used here may become obsolete at some point. In that case, you would have to replace that particular layer with a different one, which forces you to keep current with the status of encryption software on Linux.

While this setup shields you from failure on a single account, it won’t protect you if two different accounts are locked, deleted, or otherwise unavailable. That’s not a tragedy, because you should still have the master copy from which you do the sync, but you have to make sure the synchronization actually happens. The problem could be that you don’t notice you have been locked out of one account by the time the second account is locked. Also, using two different accounts with the same provider may protect you from hackers (who presumably would hack one account first, and the second one when they find the password), but it wouldn’t protect you from the company going belly up.

What sync provider should I use?

It’s really your pick. There is going to be a comparison article published on this site, tackling specifically the different offerings’ suitability for this kind of application.