ZFS on Linux with Encryption Part 2: The Compiling

First off: Warning. I don’t know what the stability of this feature is. It’s been in the code for a couple of months, it hasn’t been widely used. I’ve been testing it, and so far it’s worked as expected. 

In exploring native encryption, I attempted to get it on Linux/ZFS using the instruction on this site: https://blog.heckel.xyz/2017/01/08/zfs-encryption-openzfs-zfs-on-linux/. While I’m sure they worked at the time, the code in the referenced non-standard repos has changed and I couldn’t get anything to compile correctly.

After trying for about a day, I realized (later than I care to admit) that I should have just tried the standard repos. They worked like a charm. The instructions below compiled and successfully installed ZFS on Linux with dataset encryption on both Ubuntu 17.10 and CentOS 7.4 in the November/December 2017 time frame.

Compiling ZFS with Native Encryption

The first step is to make sure a development environment is installed on your Linux system. Make sure you have compiler packages, etc. installed. Here’s a few packages for CentOS you’ll need (you’ll need similar packages/libraries for whatever platform you run).

  • openssl-devel
  • attr, libattr-devel
  • libblkid-devel
  • zlib-devel
  • libuuid-devel

The builds were pretty good at telling you what packages you needed if they were missing, so of course install any that are requested.

You’ll need to build the SPL code and the ZFS code.

First, build the SPL code.

git clone https://github.com/zfsonlinux/spl
cd spl
make install

Then the ZFS code:

git clone https://github.com/zfsonlinux/zfs
cd zfs
./configure --prefix=/usr  # <-- This puts the binaries in /usr/sbin instead of /usr/local/sbin
make install

If you try the zfs command right away, you’ll probably get something similar to the following:

/sbin/zfs: error while loading shared libraries: libnvpair.so.1: cannot open shared object file: No such file or directory

Running ldconfig usually fixes that.

You might need to modprobe zfs to get the modules loaded, especially if you end up rebooting. There’s of course ways to auto-load the modules depending on your distribution.

Creating the zpool

zpool create -o ashift=12 storage raidz2 /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg

The -o ashift=12 is important if you have 4K sector drives, which these 8 TB WD Reds are. If you don’t throw that option in, your performance will suffer, big-time. I found my pool performed about 25% of what it did when ashift=12 was selected.

I was doing copy tests with Samba and getting only 25-30 MB/s. Once I destroyed the zpool and used ashift=12 for a new zpool on the same drives, I was able to get ~120 MB/s, which is the practical limit for a 1 Gigabit link (1,000 Gbit / 8 = 125 MB/s). Local copies were faster. Figure this out ahead of time, because to set the ashift you have to do zpool destroy, which does what it sounds like it does: Destroys the pool (and data).

The zpool will be called “storage” (yes, original) so of course use whatever name you prefer. raidz2 uses a double-parity system, so out of 6 drives, I would get a pool with the space of 4 of them (roughly 32 TBs).

The rest are the devices themselves. You don’t need to partition the drives, ZFS does it automatically.

Encryption is done on a dataset by dataset basis, which is nice to be able to have some storage be encrypted and other parts not. To create an encrypted dataset, first enable the feature in the zpool.

zpool set feature@encryption=enabled storage

Then create a new dataset under the storage zpool using a passphrase (you can also use a keyfile, but I’m opting for a passphrase):

zfs create -o encryption=on -o keylocation=prompt -o keyformat=passphrase storage/encrypted

Anything you put in /storage/encrypted/ will now be encrypted at rest.

When the system comes up, the zpool could be automatically imported (or you have to import it manually) but the /storage/encrypted/ dataset won’t be automatically added.

# zpool import storage
# zfs mount storage/encrypted -l
# Enter passphrase for 'storage/encrypted':

Once you enter the passphrase, the dataset is mounted.

ZFS and Linux and Encryption Part 1: Raining Hard Drives

(Skip to Part II to learn how to install ZFS with encryption on Linux)

Best Buy has been having a constant series of sales on WD Easy Store 8 TB drives. And it turns out, inside many of them (though not all) are WD Red NAS 5400 RPM drives. For $130-180 a piece, that’s significantly less than the regular price on Amazon/Newegg for these drives bare, which is around $250-$275.

(For updates on the sales, check out the subreddit DataHoarder.)


Over the course of several months, I ended up with 6 WD Red NAS 8 TB drives. Which is good, because my current RAID array is starting to show its age, and is also really, really full.

If you’re not familiar with the WD NAS Red’s, they’re drives specifically built to run 24/7. The regular WD Reds are 5400 RPM, so they’re a bit slower than a regular desktop drive (the Red Pro are 7200 RPM), but I don’t really care for my workload. For speed I use SSDs, and these drives for bulk storage. Plus, the slower speeds mean less heat and less power.

My current array is made of (5) 3 TB drives operating at RAID 5 for a total of about 12 TB usable. The drives are about 5 years old now, with one of them already throwing a few errors. It’s a good time to upgrade.

I’ve shucked the 8TB Reds (the process of removing the Red’s from their external case) placed the bare drives in a server case.

So now, what to do with them? I decided this was a good time to re-evaluate my storage strategy and compare my options.

My current setup is a pretty common one: It’s a Linux MD (multi-device) array with LVM (Linux Volume Manger) on top and encrypted with LUKS. It’s presented as a single 12 TB block device which has the ext4 file system on top of it.

It works relatively well, though it has a few disadvantages:

  • It takes a long time to build (days) and presumably a long time to rebuild if a drive fails and is replaced
  • It’s RAID 5, so if I lose a drive while it’s rebuilding from a previous fail, my data is toast. A common concern for RAID 5.

Here’s what I’d like:

  • Encryption: Between things like tax documents, customer data, and my Star Trek erotic fan fiction, I want data-at-rest encryption.
  • Double-parity. I don’t need the speed of RAID 10/0+1, I need space, so that means RAID5/6 or equivalent. But I don’t want to rely on just one drive, so double party (RAID 6 or equivalent).
  • Checksumming would be nice, but not necessary. I think the bit-rot argument is a little over-done, but I’m not opposed to it.

So that leaves me with ZFS (on FreeBSD or Linux) or Linux MD. I generally have a preference to stick with Linux, but if its something like FreeNAS, I’m not opposed to it.

Boyh ZFS and btrfs offer checksumming, however the RAID 5/6 parity implementation on btrfs has been deemed unsafe at this point. So if I want parity and checksumming (which I do), ZFS is my only option.

For checksumming to be of any real benefit the file system must control block devices directly. If you put them in a RAID as a single device and lay the checksumming filesystem on top of it, the only thing the checksumming can do is tell you that your files are fucked up. It can’t actually fix them.



Layered: File system on top of encryption on top of MD RAID array

The layered approach above is how my old array was done. It works fine, however it wouldn’t provide any checksumming benefit. Btrfs (or ZFS) would just have a single block device from its perspective, and couldn’t recover a bad checksum from another copy.

(Turns out you can have a single block device and recover from a bad checksum if you set ZFS to make more than one copy of the data, which of course takes more space)


ZFS encryption in FreeBSD and current ZFS on Linux: ZFS on top of encrypted block devices

ZFS encryption on FreeBSD and current ZFS on Linux is handled via a disk encryption layer, LUKS on Linux and Geli on FreeBSD. The entire drive is encrypted and the encrypted block devices are controlled by ZFS. You can do this with btrfs as well, but again the RAID5/6 problems makes it out of the question.


Native encryption with ZFS on Linux

New to ZFS on Linux is native encryption within the file system. You can, on a dataset by dataset basis, set encryption. It’s done natively in the file system, so there’s no need to run a separate LUKS instance.

It would be great it btrfs could do native encryption (and fix the RAID5/6 write hole).  In fact, the lack of native encryption has made Red Hat pull btrfs from RHEL.

Part II is how I got ZFS with native encryption working on my file server.

Do We Need Chassis Switches Anymore in the DC?

While Cisco Live this year was far more about the campus than the DC, Cisco did announce the Cisco Nexus 9364C, a spine-oriented switch which can run in both ACI mode and NX-OS mode. And it is a monster.

It’s (64) ports of 100 Gigabit. It’s from a single SoC (the Cisco S6400 SoC).

It provides 6.4 Tbps in 2RU, likely running below 700 watts (probably a lot less). I mean, holy shit.


Cisco Nexus 9364C: (64) ports of 100 Gigabit Ethernet.

And Cisco isn’t the only vendor with an upcoming 64 port 100 gigabit switch in a 2RU form factor. Broadcom’s Tomahawk II, successor to their 25/100 Gigabit datacenter SoC, also sports the ability to have (64) 100 Gigabit interfaces. I would expect the usual suspects to announce switches based on these soon (Arista, Cisco Nexus 3K, Juniper, etc.)

And another vendor Innovium, while far less established, is claiming to have a chip in the works that can do (128) 100 Gigabit interfaces. On a single SoC.

For modern data center fabric, which rely on leaf/spine Clos style topologies, do we even need chassis anymore?

For a while we’ve been reliant upon the Sith-rule on our core/aggregation: Always two. A core/aggregation layer is a traditional (or some might say legacy now) style of doing a network. Because of how spanning-tree, MC-LAG, etc., work, we were limited to two. This Core/Aggregation/Access topology is sometimes referred to as the “Christmas Tree” topology.


Traditional “Christmas Tree” Topology

Because we could only have two at the core and/or aggregation layer, it was important that these two devices be highly redundant. Chassis would allow redundancy in critical components, such as fabric modules, line cards, supervisor modules, power supplies, and more.

Fixed switches tend to not have nearly the same redundancies, and as such weren’t often a good choice for that layer. They’re fine for access, but for your host’s default gateways, you’d want a chassis.

Leaf/spine Clos topologies, which relies on Layer 3 and ECMP, and isn’t restricted the same way Layer 2 spanning-tree and MC-LAG is, is seeing a resurgence after having been banished from the DC because of vMotion.


Leaf/Spine Clos Topology


Modern data center fabrics utilize overlays like VXLAN to provide layer 2 adjacencies required by vMotion. And again we’re not limited to just two devices on the spine layer: You can have 2, 3, 4.. sometimes up to 16 or more depending on the fabric. They don’t have to be an even number, nor do they need to be a power of two now that most switches use a higher than 3-bit hash for ECMP (the 3-bit hash was the origin of the previous powers of 2 rule for LAG/ECMP).

Now we have an option: Do leaf/spine designs concentrate on larger, more port-dense chassis switches for the spine, or do we go with fixed 1, 2, or 4RU spines?

The benefit of a modular chassis is you can throw a lot more ports on them. They also tend to have highly redundant components, such as fans, power supplies, supervisor modules, fabric modules, etc. If any single component fails, the chassis is more likely keep on working.

They’re also upgradable. Generally you can swap out many of the components, allowing you to move from one network speed to the next generation, without replacing the entire chassis. For example, on the Nexus 9500, you can go from 10/40 Gigabit to 25/100 Gigabit by swapping out the line cards and fabric modules.

However, these upgrades are pretty expensive comparatively. In most cases, fixed spines would be far cheaper to swap out entirely compared to upgrading a modular chassis.

And redundancy can be provided by adding multiple spines. Even 2 spines gives some redundancy, but 3, 4, or more can provide better component redundancy than a chassis.

So chassis or fixed? I’m leaning more towards a larger number of fixed switches. It would be more cost effective in just about every scenario I can thing of, and still provides the same forwarding capacity of a more expensive chassis configuration.

So yeah, I’m liking the fixed spine route.

What do you think?



Fibre Channel of Things (FCoT)

The “Internet of Things” is well underway. There are of course the hilarious bad examples of the technology (follow @internetofshit for some choice picks), but there are many valid ways that IoT infrastructure can be extremely useful.  With the networked compute we can crank out for literally pennies and the data they can relay to process, IoT is here to stay.

Hacking a dishwasher is the new hacking a gibson

But there’s one thing that these dishwashers, cars, refrigerators, Alexa’s, etc., all lack: Access to decent storage.

The storage on many IoT devices is either terrible or nonexistent. Unreliable flash storage or no storage at all. That’s why the Fibre Channel T19 working group created a standard for FCoT (Fibre Channel of Things). This gives small devices access to real storage, powered by arrays not cheap and unreliable local flash storage.

The FCoT suite is a combination of VXSAN and FCIP. VXSAN provides the multi-tenancy and scale to fibre channel networks, and FCIP gives access to the VXSANs from a variety of FCaaS providers over the inferior IP networks (why IoT devices chose IP instead of FC for their primary connectivity, I’ll never know). Any IoT connected device can do a FLOGI to a FCaaS service and get access to a proper block storage. Currently both Amazon Web Services and Microsoft Azure offer FCoT/FCaaS services, with Google expected to announce support by the end of June 2017.

Why FCoT?

Your refrigerator probably doesn’t need access to block storage, but your car probably does. Why? Devices that are sending back telemetry (autonomous cars are said to produce 4 TB per day) need to put that data somewhere, and if that data is to be useful, that storage needs to be reliable. FCaaS provides this by exposing Fibre Channel primitives.

Tiered storage, battery backed-up RAM cache, MLC SSDs, 15K RPM drives, these are all things that FCoT can provide that you can’t get in a mass-produced chip with inexpensive consumer flash storage.

As the IoT plays out, it’s clear that FCoT will be increasingly necessary.


Video: Newbie Guide to Python and Network Automation

Why We Wear Seat Belts On Airplanes

This post is inspired by Matt Simmons‘ fantastic post on why we still have ashtrays on airplanes, despite smoking being banned over a decade ago. This time, I’m going to cover seat belts on airplanes. I’ve often heard people balking at the practice for being somewhat arbitrary and useless, much like balking at turning off electronic devices before takeoff. But while some rules in commercial aviation are a bit arbitrary, there is a very good reason for seat belts.


In addition to being a very, very frequent flier (I just hit 1 million miles on United), I’m also a licensed fixed wing pilot and skydiving instructor. Part of the training of any new skydiver is what we call the “pilot briefing”. And as part of that briefing we talk about the FAA rules for seat belts: They should be on for taxi, take-off, and landing. That’s true for commercial flights as well.

Some people balk at the idea of seat belts on commercial airliners. After all, if you fly into the side of a mountain, a seat belt isn’t going to help much. But they’re still important.


Your Seat Belt Is For Me, My Seat Belt Is For You

In a car, the primary purpose of a seat belt is to protect you from being ejected, and to keep you in one place so the car around you (and airbags) can absorb the impact of an impact. Another purpose, one that is often overlooked, is to keep you from smashing the ever loving shit out of someone who did wear their seat belt.

In skydiving, we have a term that encompasses the kinetic and potential energy contained within the leathery sacks of water and bones known as humans: Meat missiles. Unsecured cargo, including meat missiles, can bounce around the inside of airplanes if there’s a rough landing or turbulence. With all the energy and mass, we can do a lot of damage. That’s why flight attendants and pilots punctuate their “fasten you seat belt” speech with “for your safety and the safety of those around you“.

A lot of people don’t realize that if you don’t wear a seat belt, you’re endangering those around you as much as, or more so, than yourself. Your seat belt doesn’t do much good if a meat missile smashes into you. Check out the GIF below:

In the GIF, there’s some sort of impact and as a result the unsecured woman on the left smashes into the secured woman on the right. It’s hard to tell how bad they were hurt, though it could have been a lot worse having two heads smash into each other. The side airbag doesn’t do much good if one solid head hits another solid head. Had the woman on the left had her seat belt on it’s likely their injuries would be far less severe.

While incidents in commercial aviation are far more rare than cars, there can be rough landings and turbulence, both expected and unexpected, and even planes colliding while taxing. Those events can cause enough movement to send meat missiles flying, hence the importance of seat belts.

Commercial aviation is probably the safest method of travel, certainly safer than driving. But there is a good reason why we wear seat belts on airplanes.So buckle up, chumps.

Did VMware vSphere 6.0 Remove the Layer 2 Adjacency Requirement For vMotion? No.


I’ve seen this misconception a few times on message boards, reddit, and even comments on this blog: That Layer 2 adjacency is no longer required with vSphere 6.0, as VMware now supports Layer 3 vMotion. The (mis)perception is that you no longer need to stretch a Layer 2 domain between ESXi hosts.

That is incorrect. VMware did remove a Layer 2 adjacency requirement for the vMotion Network, but not for the VMs. Lemme explain.

It used to be (before vSphere 6.0) that you were required to have the VMkernel interfaces that performed vMotion on the same subnet. You weren’t supposed to go through a default gateway (though I think you could, it just wasn’t supported). So not only did your VM networks need to be stretched between hosts, but so did your VMkernel interfaces that performed the vMotion sending/receiving.

What vSphere added was a separate TCP/IP stack for vMotion networks, so you could have a specific default gateway for vMotion, allowing your vMotion VMkernel interfaces to be on different subnets.

This does not remove the requirement that the same Layer 2 network exist on the sending and receiving ESXi host. The IP of the VM needs to be the same, so the VM network you vMotion to needs to have the same default gateway (for outbound packets) and inbound routing (for inbound packets).

Inside of a data center this adjacency is typically done by simply making the same VLAN available (natively or now through VXLAN) on all the ESXi hosts in the cluster.

If it’s between datacenter, things tend to get a more complicated. As in dumpster fire. Here’s a presentation I recently did on the topic, and Ivan Pepelnjak has far more high-brow explanations of why it’s a bad idea.

You’ll need solutions like LISP (for inbound), FHRP filtering (for outbound), OTV (for stretching the VLAN), and a whole host of other solutions to handle all the other problems long distance vMotion can introduce.

Screen Shot 2016-05-24 at 12.20.33 PM.png

Where is your God now?!?!?

So when you hear that vSphere 6 no longer requires Layer 2 adjacency between ESXi hosts, that’s only for the vmkernel interfaces, not the VM networks. So yes, Virginia, you still need Layer 2 adjacency for vMotion. Even in vSphere 6.0.


Long Distance vMotion Is A Dumpster Fire

In this screencast, I go on a rant about why long-distance vMotion is a dumpster fire. Seriously, don’t do it.

Fibre Channel in the Cloud: FCaaS

Public cloud providers such as Amazon Web Services, Microsoft Azure, and Rackspace, as well as private cloud systems such as OpenStack, have dominated the computing landscape for the past several years. And once a joke of a marketing term (remember Larry Ellison’s super villain-monologue on the topic?), the cloud is now A Thing, with a definition and everything.

One technology that seemed like it was getting left behind in all these cloud games, however, was Fibre Channel. Ephemeral compute nodes, object storage, extreme scale, elastic provisioning — all of these were characteristics that were initially thought to be bad fits for Fibre Channel.


Sad Fibre Channel is Sad

As it turns out, Fibre Channel is right at home in the cloud.


Amazon Web Services has recently rolled out Fibre Channel as a Service (FCaaS), as have Rackspace, Digital Ocean, and Microsoft Azure.

All of those public cloud providers have some sort of block storage offerings, but they’re typically based on something like iSCSI or another back-end block protocol. Customers have been demanding the kind of block storage in the public cloud, where they can control zoning and zonesets, just like they do in their traditional data centers worlds.

The problem with that historically is that AWS and the others haven’t been able to provide this to customers because of the limitations of Fibre Channel at scale. I’ll explain.

Fibre Channel uses FC_IDs, which are like IP addresses, to send Fibre Channel frames around a given SAN. Here’s an FC_ID: 0x510121.

It’s a 24-bit number, typically written in hexadecimal notation. The first octet (two digits) is known as the domain ID. This is given to the switch, so that means there’s a limit of about 240 or so switches in a given fabric (some domain IDs are reserved). Plus, the two vendors of Fibre Channel switches (Brocade and Cisco) limit domain IDs to a maximum of 50 or so, so no more than 50 or so switches for a given fabric.

For a private data center with a single tenant, this isn’t a problem as a 50 switch Fibre Channel fabric is huge. But for Amazon, 50 switches is miniscule.

So enter VXSAN. The SNIA introduced VXSAN recently under the T18 working group, which provides an extension of typical Fibre Channel frame formats. Like VXLAN, VXSAN adds a higher degree of segmentation.

Cisco has VSANs of course, and Brocade has Virtual Fabrics. Neither are compatible with each other, and neither provide the additional scale required to handle massive cloud scale. VXSAN fixes both of those. VXSAN will work on a traditional Fibre Channel SAN from either Brocade or Cisco, without modification through use of the Open Virtual Fibre Channel Switch.

Wait, what?

That’s right, part of any VXSAN implementation is the Open Virtual Fibre Channel Switch (kind of a mouthful, even with the acronym OVFCS).

Similar to how VXLAN operates an overlay network on a traditional IP network as an underlay, VXSAN operates as an overlay SAN on top of a traditional Fibre Channel SAN.

Instead of VTEPs, OVFC switches terminate the VXSAN segments into virtualization hosts and VXSAN aware storage arrays (both EMC and NetApp have them in their latest software revs) to terminate the VXSAN-applied LUN to the a given virtual machine.


A given virtualization host has two virtual Fibre Channel switches (A/B), each connected to their own Fibre Channel interface (A/B).


The virtual Fibre Channel switches rely on upstream NPIV to get their connectivity, so they can run alongside the hypervisor’s traditional SCSI subsystem. In the example below, both virtual Fibre Channel switches do FLOGIs, as does the hypervisor.


The virtual machines, however, to a vFLOGI into the VXSAN segment, not into the traditional switching infrastructure. The upstream physical switches have no idea a FLOGI happened from the VM.


The VXSAN header, like VXLAN, has a 24-bit address space, providing 16 million segments, each with their own VXSAN fabric capable of having a full Fibre Channel fabric with up to 239 virtual Fibre Channel switches each. So while 239 Fibre Channel switches won’t work for Amazon, 3.8 billion will (16 million x 239).

You will have to enable Fibre Channel jumbo frames on your traditional Fibre Channel fabric, as the VXSAN header adds 62 bytes to the frame format.

VXSAN is designed to run on VXSAN-unaware switches, as it takes for new header formats to make it into silicon, but both Cisco and Brocade have said they plane to release VXSAN-aware switches by the end of the year.

VXSAN is built to be mulit-tenant, so customers from Amazon and others can do their own zoning. I got to play with a Beta of the FCaaS from AWS and I did just a quick configuration with a single VM and a virtual LUN.

First, you log into the A or B virtual Fibre Channel switches. There’s no password, you use the keys you’ve uploaded into Amazon.

Linux Foundation Open Virtual Fibre Channel Switch (Read the Apache 2.0 License for licensing details)
switch# config
switch#(config) zone Host1
switch(config-zone)# member pwwn 20:00:00:12:34:45:67:aa
switch(config-zone)# member pwwn 50:00:00:00:00:ab:cd:ef

I was able to push a zoneset and connected my instance to storage pretty quickly. All in all, it only took about 10 minutes to get it up and running.

OpenStack is prepating to include FCaaS and the Open Virtual Fibre Channel Switch in with the next release (Mikata) due out this month.

So check out FCaaS on Amazon, Azure, and the others. FCaaS should bring Fibre Channel into the cloud world.

Edit: Also, this is an April Fool’s joke. 5 years running.

Traditional versus Cloud Native Web Applications

Here’s a quick whiteboard session of the differences between traditional and cloud native web applications.