BTRFS and RAID1

We wanted to use BTRFS and RAID1 on our new servers. There were a few hurdles described here. You might want to read The perfect Btrfs setup for a server first, and maybe also Using RAID with btrfs and recovering from broken disks and Using Btrfs with Multiple Devices, and Restoring UEFI Boot Entry after Ubuntu Update.

What this page offers over the others is a little better coverage of EFI issues, and how to deal with a failed disk.

We use Debian 9 (stretch) with a 4.11 kernel (from Debian testing) in this work. I use our concrete drive names here, so make extra sure to use your drive names when doing this on your own machine or you may destroy valuable data. All commands need root permissions.

Installing RAID1 for BTRFS

Debian does not support Btrfs RAID out of the box, so the way to go is to start to install BTRFS without RAID on one of the disk drives, leave the same space on a partition on the other drive(s), and then do

btrfs device add /dev/sdb3 /
btrfs balance start -dconvert=raid1 -mconvert=raid1 /

We also add "degraded" as file system option in fstab; e.g.:

UUID=3d0ce18b-dc2c-4943-b765-b8a79f842b88 /               btrfs   degraded,strictatime        0       0

The UUID (check with blkid) is the same for both partitions in our RAID, so no need to specify devices.

EFI and RAID1

There is no support for RAID1 and EFI in Linux or Debian, so what we did was to have one EFI system partition (ESP) on each drive, let the Debian installer install the the EFI stub of grub on one of them, and then use dd to copy the contents of the ESP to the other ESP partition(s):

dd if=/dev/sda1 of=/dev/sdb1

This has to be repeated every time the EFI partition changes, but it seems that this normally does not change, even when running update-grub. OTOH, it does not hurt to do the dd more often than necessary.

We also needed to change /etc/grub.d/10_linux in different places than "The perfect Btrfs setup for a server" (which seems to be written for a BIOS/MBR system) indicates: Search for " ro " (two occurences), and prepend "rootflags=degraded". One of these lines becomes

	linux	${rel_dirname}/${basename}.efi.signed root=${linux_root_device_thisversion} rootflags=degraded ro ${args}

In order for that to take effect, we had to

update-grub

What to do on a failed disk

We disconnected one of the disks (while the system was offline, online would have been an interesting variant) to simulate a disk failure. Due to a bug in BTRFS, it degrades nicely on the first boot, but then becomes irreversibly read-only on the second boot. If you get there, the best option seems to be to copy the read-only file system to a fresh and writable file system (with e.g., tar or cpio). (We actually went as far as having the read-only file system with one of the two drives we used, so the bug is still there in Linux 4.11).

You probably want to avoid these complications, and while you are still in your first boot, you can. What we did (with the other disk), is to convert it back from RAID1 to a single profile, then remove the failed device (which complains that it does not know the device, but still removes it).

btrfs balance start -v -mconvert=dup -dconvert=single /
btrfs device remove /dev/sdb3
#now check that it has worked
btrfs device usage /
btrfs fi show
btrfs fi usage

We then shut down the system, plugged the replacement disk in (actually the disk we had earlier ruined by double degraded booting, after wiping the BTRFS partition), booted and then did the usual dance to turn the now-single BTRFS into a RAID1 again:

btrfs device add /dev/sdb3 /
btrfs balance start -dconvert=raid1 -mconvert=raid1 /

As a result, we had a RAID1 again.

If you wonder why we did not use btrfs replace: We would have to connect the new disk before the second reboot, which is not always practical. With the method above, once we have rebalanced the file system to a single one, we can reboot as often as we like to get the new drive online.

What about swap space?

For swap space, my recommendation is to use an md RAID1 (or, if you don't mind slowness, a swap file on the btrfs), or no swap at all. On one of our machines, we had swap space on non-RAIDed partitions (one partition on each drive). One drive disconnected, and pretty much all my processes failed (when I next tried to do something with them); apparently they had been pushed to swap space by the backup program, and when I wanted to do something with them, they found that their memory was gone. However, we did not yet have this experience when we did our experiments with btrfs and RAID1, and on that server we did not use RAID1 for the swap partitions. When we rebooted with one of the disks disconnected, Debian 9 waited 90s for the second swap partition to appear, then continued booting. Nice graceful degradation (adding swap lazily would have been even better).

Anton Ertl