Building a distributed / redundant data storage backbone - part 1 of 2

I'm not a fan of losing data, so I always have some sort of redundant storage at my disposal. At present I posess over 10TB of irreplacable data, code, photographs, etc, (and much more replacable data - though at an inconvenience) and housing that data is no easy task. Be it for backups, media storage, or even temp space, having a huge pool of "safe" storage is always a useful tool.

A bit of history first - we'll get into the cutting edge stuff, with more technical detail, in part 2.

My first adventure into this involved a number of SATA hard drives attached to a Mac Mini via USB 3.0. A linux VM assembled these drives into a RAID 5 array and served their contents over SMB ('server message block' - windows sharing). Unfortunately this was in the early days of USB 3.0 so the pass-through adapter to the linux virtual machine only supported USB 2.0 speeds. So this array maxed out around 15MB/s constant write. Not great, but at least it's safe.

Anyway, why Raid? Raid is a immensly valuable idea that allows the use of many independant hard or solid state drives for a single logical task. Raid can be used for speed gains beyond what an individual drive can handle or failure proof storage where even the death of one or more drives causes no data loss - and in the best case, no interruption to services using the data. There are many different kinds of traditional raid, with the most commonly used being:

  • Raid 0 - performance gains by spreading load across multiple drives
  • Raid 1 - stability gains by mirroring data on two or more drives
  • Raid 5 - stability gains by smart use of striping & parity

These (and other) levels of raid can also be combined for a mix of the effects a single raid level brings. For example, raid 10 (1+0, striped set of mirrors) or is fairly common.

Traditional raid - or "hardware" raid - typicially involves a hardware expansion card for the dedicated purpose of handling raid computations and drives. The raid card would have a builtin processor, memory, and sata ports to which the involved disks would be attached to. Up until recently, this sort of configuration was necessary for top performance - software raid has been around for awhile, but historically there was a performance gap. A risk of hardware raid has always been the raid card itself. Raid cards typically have an onboard memory module a battery to keep the memory "alive". If any of these parts cease to function - or any other part of the card - it's possible that all data on the raid array can be lost! In my opinion, this isn't a viable option. If you have your data on a single drive or a raid array controlled by one raid card, either there is a single piece of hardware that can die and cause data loss. Not good.

Software raid addresses this issue. It may have originally been created as a cheaper (and slower) alternative to hardware raid, but I believe nowdays software options are just as valid and performant has hardware raid is. Software raid, in a nutshell, is, technically speaking, the same as hardware raid, but with the computation and memory required for operation handled by a system's main CPU and memory as opposed to a dedicated piece of hardware. Done right, these components can die or be swapped out with no fear of data loss.

In my opinion the most notable advances in software raid tech are as follows:

  • Mdadm - often referred to as 'linux software raid' - was introduced in 2001 and quickly became the de facto software raid tool.
  • ZFS - introduced in 2005 (and from my perspective, catching on in the early 2010s) was an entirely new beast.

While ZFS is impressively (and I'm using it today!), let's examine mdadm first so we can see how ZFS differs. Mdadm offers block-level raid. In other words, it doesn't care which (and you have to use) filesystem is "on top" of it. On linux, mdadm consumes block devices and provides a virtual block device. So if the physical drives on your system are /dev/sda and /dev/sdb, mdadm can consume these and provide /dev/md0. From there, md0 can be treated just like a plain block device - create partition(s) and format them with a filesystem of choice. 

My first and second storage arrays used Mdadm. It was an easy choice as tools that work with native block devices work with mdadm virtual block devices. I could partition the virtual device with common disk management tools like fdisk and use common filesystems like ext.

And I did exactly that - my first array comprised of ext4 on top of mdadm. And it worked! ....until I needed an array bigger than 16TB. Because of how ext4 was coded (and integrated into linux operating systems) it is incapable of handling filesystems greater in size than 16TB. Considering it originated in the early 90s and focused on backwards compatibility, there wasn't a workaround. Back to the drawing board.

What's nice about linux is that filesystems are ubiquitous - if the system can mount a filesystem, all programs on the system that depend on file access and use it and all features it expects a filesystem to support. A beauty of modern system design. An up-and-coming filesystem around the time I needed to expand was btrfs. With a maximum supported filesystem size of 16EB (exabytes!), or a million times larger than ext4's limits, I was sure I'd never outgrow it. So I live convered my array to btrfs and it was full steam ahead from there!

Great! So in early 2011 I had a storage backbone capable of supporting an astronomical amount of data, the flexibility to exapand over time, and safety of software raid. What more could be desired? Well, the world of computing constantly improves itself so I wasn't happy sticking with a single methodology. But more on that, and ZFS especially, in part 2!

Tags: