Wow: NVMe and PCIe Gen 4

September 16, 2019 Leave a comment

Recently it’d come to my attention that my old PC rig wasn’t cutting it.

Intel i7 950
18 GB of RAM
X58 Asus-based motherboard
Sandisk 1 TB SSD
2 x 8 TB Shucked Best Buy Easystore Hard Drives in Windows Storage Spaces (see why I didn’t do parity storage because of the Microsoft shit-show)
1 Gigabit Intel NIC
NVidia GTX 980

Considering it was 10 years old, it was doing really well. I mean, I went from HDD to 500 GB SSD to 1 TB SSD, up’d the RAM, and replaced the GPU at least once. But still, it was a 4-core system (8 threads) and it had performed admirably.

The Intel NIC was needed because the built-in ASUS Realtek NIC was a piece of crap, only able to push about 90 MB/s. The Intel NIC was able to push 120 MB/s (close to the theoretical max for 1 Gigabit which is 125 MB/s).

The thing that broke the camel’s back, however, was video. Specifically 4K video. I’ve been doing video edits and so forth in 1080p, but moving to 4K and the power of Premerier Pro (as opposed to iMovie) was just killing my system. 1080p was a challenge, and 4K made it keel over.

I tend to get obsessive about new tech purchases. My first flat screen TV purchase in 2006 was the result of about a month of in-depth research. I pour over specs and reviews for everything from parachutes (btw, did you know I’m a skydiver?) to RAM.

Eventually, here’s the system I settled on:

Ryzen 7 3700x CPU (8 cores/16 threads, 3.6 GHz boost to 4.4 GHz)

ASRock Steel Legend: https://www.newegg.com/p/N82E16813157894?Item=N82E16813157894

ASRock Radeon 5700XT: https://www.newegg.com/asrock-radeon-rx-5700-xt-rx-5700-xt-challenger-d-8g-oc/p/N82E16814930020?Item=N82E16814930020

32 GB of Corsair Vengence RAM https://www.newegg.com/corsair-32gb-288-pin-ddr4-sdram/p/N82E16820236454?Item=N82E16820236454

1 TB NVMe M.2 https://www.amazon.com/gp/product/B07TBBB9BQ/ref=ppx_yo_dt_b_asin_title_o01_s00?ie=UTF8&psc=1

Case: https://www.newegg.com/matte-white-black-nzxt-h-series-h510i-atx-mid-tower/p/N82E16811146320?Item=N82E16811146320

Power Supply: https://www.newegg.com/corsair-rm-series-rm850-cp-9020196-na-850w/p/N82E16817139248?Item=N82E16817139248

AMD came out of nowhere and launched Ryzen 3, which put ADM from a budget-has-been to a major contender in the desktop world. Plus, they were the first to come out with PCIe Gen 4.0, which allowed for each lane of PCIe to give you 2 GB/s of bandwidth. m.2 drives can connect to 4 lanes, giving a possible throughput of 8 GB/s of bandwidth.

Compare that with SATA 3, at 600 MB/s, and that’s quite a difference. SATA is fine for spinning rust, but it’s clear NVMe is the only way to unlock SSD storage’s potential.

When I built the system, I initially installed Linux (CentOS 7.6, to be exact) just to run a few benchmarks. I was primarily interested in the NVMe drive and the throughput I could expect. The drive advertises 5 GB/s reads and 4.3 GB/s writes.

Using dd if=/dev/zero of=testfile and using various blocksizes and counts to write a 100 GB file, I was able to get about 2.8 GB/s writes. Not quite what the drive had promised in terms of writes, but much better than the 120. I was able to get about 3.2 GB/s reads.

For various reasons (including that while Linux is a fantastic OS in lots of regards, it still sucks on the desktop, especially for my particular needs) I loaded up Windows 10. CrystalDiskMark is a good free benchmark and I was able to test my new NVMe drive there.

I ran it, thinking I’d get the same results from Linux. Nope!

I got pretty much what the drive promised.

As a comparison, here’s how my old SATA SSD fared:

About 10x performance. Here’s a couple of takeaways:

PCIe 4 does matter for storage throughput. Would I actually notice in my day-to-day operations the difference between PCIe 3 and PCIe 4? Probably not. But I’m working with 4K video and some people are already working with 6K and even 8K video, that’s not too far down the line for me.

SATA is dead for SSD storage. The new drives are more than capable of utterly overwhelming SATA 3 (600 MB/s, LOL). Right now, SATA is sufficient for HDDs, but as platters get bigger sequential reads will continue to climb.

I don’t doubt that Linux can do the same, it’s just my methodology failed me. The dd command from /dev/zero had never failed to be the best way to test write speeds for HDD and SATA SSDs, but now I need to find another method for Linux (or perhaps there is some type of bottleneck in Linux).

TL;DR

New PCIe 4 NVMe SSDs are super fast and can be had for a relatively low amount of money ($180 USD for 1 TB). They’re insanely fast.

I need a new way to benchmark Linux storage.

Filed under Uncategorized

For ESXi: Realtek NICs Are Awful And Don’t Use Them

May 24, 2019 Leave a comment

OK, this isn’t a really a controversial opinion. This is more as a guide for those who run into these problems when trying to setup their first whitebox/homelab systems for ESXi.

So it goes something like this: You’ve got an old desktop, gaming rig, or workstation. You decide you’ll retire it to your home data center (or basement, or laundry room) as a hypervisor. ESXi by itself (no vSphere controller) is free, and here’s how to download and get the license key.

For most desktop/workstation type of hardware, you can install ESXi from the general ESXi installer except for one aspect: Many of these types of systems use Realtek, Marvell, or other desktop/consumer grade NICs, and there’s not an ESXi driver for these. And for good reasons: They suck.

So you have the choice: Try to use a special custom ISO installer with the Realtek?Marvell/etc. driver loaded, or buy a different NIC. In most of IT, there’s usually more than one right answer, and a heaping dose of “it depends”. However, for this particular question (Realtek or buy another NIC) there’s only right right answer: Buy another NIC.

Realtek NICs suck. They don’t perform well, they’re a pain to work with for ESXi, so just buy a NIC. The other desktop NICs don’t fare much better. If it’s not recognized by ESXi, it’s a pretty good bet it’s shit.

You can get a one or two port Intel Pro 1000 NIC on eBay for $20-30 USD. These NICs work great. I’ve even replaced the Realtek NIC on my Windows 10 Pro workstation and went from 700 Mbps to fully saturating a gigabit NIC for file transfers. (Make sure they’re Intel Server NICs, the Pro NICs, and not the desktop NICs.)

For $20-30 additional, you can install ESXi on just about any desktop or workstation hardware with the standard ESXi installer. I’m sure there are edge cases, but for me desktop/workstation plus Intel Pro NIC has worked fine.

Filed under Uncategorized

Certification Exam Questions That I Hate

March 12, 2019 Leave a comment

In my 11 year career as an IT instructor, I’ve had to pass a lot of certification exams. In many cases not on the first try. Sometimes for fair reasons, and sometimes, it feels, for unfair reasons. Recently I had to take the venerable Cisco CCNA R&S exam again. For various reasons I’d allowed it to expire, and hadn’t taken many exams for a while. But recently I needed to re-certify with it which reminded me of the whole process.

Having taken so many exams (50+ in the past 11 years) I’ve developed some opinions on the style and content of exams.

In particular, I’ve identified some types of questions I utterly loath for their lack of aptitude measurement, uselessness, and overall jackassery. Plus, a couple of styles that I like.

This criticisms is for all certification exams, from various vendors, and not limited to even IT.

To Certify, Or Not To Certify

The question of the usefulness of certification is not new.

One one hand, you have a need to weed out the know-its from the know-it-nots, a way to effectively measure a person’s aptitude in a given subject. A certification exam, in its purest form, is meant to probe the knowledge of the applicant.

On the other hand, you have an army of test-dumping dullards, passing exams and unable to explain even basic concepts. That results in a cat-and-mouse game between the exam creators and the dump sites.

And mixed in, you have a barrage of badly formed questions that are more appropriate to your local pub’s trivia night than it is a professional aptitude measurement.

So in this article I’m going to discuss the type of questions I despise. Not just because they’re hard, but because I can’t see how they accurately or fairly judge a person’s aptitude.

Note: I made all of these questions up. As far as I know, they do not appear on any certification exam from any vendor. This is not a test-dump.

Pedantic Trivia

The story goes that Albert Einstein was once asked how many feet are in a mile. His response was this: “I don’t know, why should I fill my brain with facts I can find in two minutes in any standard reference book?”

einstein

I really relate to Einstein here (we’re practically twinsies). So many exam questions I’ve sat through were pure pedantic trivia. The knowledge of the answer had no bearing on the aptitude of the applicant.

Here’s an example, similar to ones I recall on various exams:

What is the order of ink cartridges in your printer? Choose one.

A: Black, Magenta, Cyan, Yellow

B: Yellow, Cyan, Magenta, Black

C: Magenta, Cyan, Black, Yellow

Assuming you have a printer with color cartridges, can you remember the order they go in? Do you care? Does it matter? Chances are there’s a diagram to tell you were to put them.

Some facts are so obscure they’re not worth knowing. That’s why reference sources are there.

I can even make the argument about certain details about regularly used aspects of your job. Take VRRP for example. For network administrators, VRRP and similar are a way to have two or more routers available to answer to a single IP address, increasing availability. This is a fundamental networking concept, one that any network administrator should know.

VRRP uses a concept known as a vMAC. This is a MAC address that sits with the floating IP address, together making a virtual router that can move between physical routers.

So far, everything about what I’ve described about VRRP (and much more that I haven’t) would be fair game for test questions. But a question that I think is useless is the following:

The vMAC for VRRP is (where XX is the virtual router ID):

A: 00:01:5A:01:00:XX

B: 00:00:5A:01:00:XX

C: 00:01:5E:00:FF:XX

D: 00:00:5E:00:01:XX

I’m willing to bet that if you ask 10 good CCIEs what the vMAC address of a VRRP is, none would be able to recite. Knowledge of this address has no bearing on your ability to administer a network. How VRRP works is important to understand, but this minutia is useless.

tenor

I have two theories where these questions come from.

Theory 1: I’ve written test questions (for chapter review, I don’t think I’ve written actual certification questions) and I know it’s difficult to come up with good questions. Test banks are often in the hundreds, and it can be a slog to make enough. Trivia questions are easy to come up with and easy to verify.

Theory 2: Test dumpers. In the cat and mouse game between test writers and test dumpers, vendors might feel the need to up the difficulty level because pass rates get too high (which I think only hurts the honest people).

Exact Commands

Another one I really despise is when a question asks you for the exact command to do something. For example:

Which command will send the contents of one directory to a remote server using SSH?

A: tar -cvf – directory | ssh root@192.168.10.10 “cd /home/user/; tar -xvf -”

B: tar -xvf – directory | ssh root@192.168.10.10 “cd /home/user/; tar -xvf -”

C: tar -cvf – directory > ssh root@192.168.10.10 “cd /home/user/; tar -cvf -”

D: ssh root@192.168.10.10 “cd /home/user/ tar -xvf -” > tar -xvf directory

For common tasks, such as deleting files, that’s probably fair game (though not terribly useful). Most CLIs (IOS, Bash, PowerShell) has tab completions, help, etc., so that any command syntax can be looked up. Complex pipes like the former are the kind I use with some regularity, but I often have to look it up.

The Unclear Questions

I see these in certification tests all the time. It’ll be a question like the following:

What are some of the benefits of a pleasant, warm, sunny day? (Choose Three)

A: Vitamin D from sunlight
B: Ability to have a picnic in a park
C: No need for adverse weather clothing
D: Generally improves most people’s disposition

Look at those answers. You could make an argument for any of the four, though the question is looking for three. They’re all pretty correct. Reasonable people, even intelligent, experienced people, can disagree on that correct answer is.

Questions I Do Like

I try not to complain about something if I don’t have something positive to contribute. So here’s my contribution: These are test questions that I think are more than fair. If I don’t know the answers to these types of questions, I deserve, in every sense of fairness, to get the question wrong.

Scenario Questions

A scenario question is something like this: “Given X, what would happen”.

For example, if a BDPU was received on portfast enabled interface, what would happen?

If a host with an IP netmask combo of 192.168.1.10/24 was to try to communicate with a host configured on the same Layer 2 segment with an IP address of 192.168.1.119/25, would they be able to communicate?

I like those types of questions because they test your understanding of how things work. That’s far more important for determining competency I think.

There are some network basics, that might seem like trivia, but knowing would be important to know. For example:

What is the order of a TCP handshake?

A: ACK, SYN/ACK, SYN

B: SYN, SYN/ACK, ACK

C: SYN, ACK/SYN, SYN

D: ACK, ACK/SYN, SYN

This question is fundamental to the operations of networks, and I would hope any respectable network engineer would know this. This would be important for TCP dump analysis, and other fundamental troubleshooting.

Conclusion

If you write test questions, ask yourself: Would the best people doing what this question tests get this answer right? Is it overly pedantic? Is there a clear answer?

This was mostly written as a frustration piece. But I think I’m not alone in this frustration.

Filed under Uncategorized

A Discussion On Storage Overhead

January 27, 2019 1 Comment

Let’s talk about transmission overhead.

For various types of communications protocols, ranging from Ethernet to Fibre Channel to SATA to PCIe, there’s typically additional bits that are transmitted to help with error correction, error detection, and/or clock sync. These additional bits eat up some of the bandwidth, and is referred to generally as just “the overhead”.

For 1 Gigabit Ethernet and 8 Gigabit Fibre Channel as well as SATA I, II, and III, they use 8/10 overhead. Which means for every eight bits of data, an additional two bits are sent.

The difference is who pays for those extra bits. With Ethernet, Ethernet pays. With Fibre Channel and SATA, the user pays.

1 Gigabit Ethernet has a raw transmit rate of 1 gigabit per second. However, the actual transmission rate (baud, the rate at which raw 1s and 0s are transmitted) for Gigabit Ethernet is 1.25 gigabaud. This is to make up for the 8/10 overhead.

SATA and Fibre Channel, however, do not up the baud rate to accommodate for the 8/10 overhead. As such, even though 1,000 Gigabit / 8 bits per byte = 125 MB/s, Gigabit Fibre Channel only provides 100 MB/s. 25 MB/s is eaten up by the extra 2 bits in the encoding. The same is true for SATA. SATA 3 is capable of transmitting at 6 Gigabits per second, which is 750 MB/s. However, 150 MB/s of that is eaten up by the extra 2 bits, so SATA III can transmit 600 MB/s instead.

PAM 4

There’s a new type of raw data transmission hitting the networking world called PAM 4. Right now it’s used in 400 Gigabit Ethernet. 400 Gigabit Ethernet is 4 channels of 50 Gigabit links. You’ll probably notice the math on that doesn’t check out: 4 x 50 = 200, not 400. That’s where PAM 4 comes in: The single rate change is still 50 gigabaud, but instead of the signal switching between two possible values (0, 1), it switches between 4 possible values (0, 1, 2, 3). Thus, each clock cycle can represent 2 bits of data in stead of 1 bit of data, doubling the transmission rate.

Higher Level Protocol Overhead

For networking storage on Ethernet, there’s also additional overhead for IP, TCP/UDP, and possibly others (VXLAN for example). In my next article, I’ll talk about why they don’t really matter that much.

Filed under Uncategorized

A Primer for Home NAS Storage Speed Units and Abbreviations

January 26, 2019 Leave a comment

One of the most common mistakes/confusion I see with regard to storage is how speed is measured.

In tech, there’s some cultural conventions to which units speeds are discussed in.

In the networking world, we measure bits per second
In the storage and server world, we measure speed in bytes per second

Of course they both say the same thing, just in different units. You could measure bytes per second in the networking world and bits per second in the server/storage world, but it’s not the “native” method and could add to confusion.

For NAS, we have a bit of a conundrum in that we’re talking about both worlds. So it’s important to communicate effectively which method you’re using to measure speed: bits of bytes.

Generally speaking, if you want to talk about Bytes, you capitalize the B. If you want to talk about bits, the b is lower case. I.e. 100 MB/s (100 Megabytes per second) and 100 Mbit or Mb (100 Megabit per second).

This is important, because there a 8 bits in a byte, the difference in speed is pretty stark depending on if you’re talking about bits per second or bytes per second. Examples:

200 Mb/s is written to mean 200 Megabits per second
200 MB/s is written to mean 200 Megabytes per second

Again, the speed difference is pretty stark:

200 Mb/s (Megabits per second, about 1/5th of the total rate available on Gigabit Ethernet) = 25 Megabytes per second
200 MB/s (Megabytes per second, almost double what a Gigabit Ethernet links could send) = 1.6 Gigabits/second

200 Mb/s easily fits in a Gigabit Ethernet link. 200 MB/s is more than a Gigabit Ethernet link could handle.

Abbreviations

It’s generally acceptable to write bits per second as Xb, Xbit, Xbit/s, and Xbps, where X is the multiplier prefix (Mega, Giga, Tera, etc.)

The following are examples of 1.21 Gigabits per second :

1.21 Gbps
1.21 Gb/s
1.21 Gbit/s

It’s generally acceptable to write bytes per second as XB, XByte, XByte/s, and XBps, where X is the multipler (Mega, Giga, Tera, etc.)

The following are examples of 1.21 Gigabytes per second:

1.21 GBps (less common)
1.21 GB/s
1.21 GByte/s

A Gigabit Ethernet interface can theoretically handle 125 MB/s (1,000 mbit / 8 bits per byte = 125). A 10 Gigabit Ethernet interface. Depending on your NIC, horsepower, and systems, you may or not be able to reach that. But that’s the theoretical limit for Gigabit Ethernet.

10 Gigabit Ethernet (10GE) can theoretically handle 1250 MB/s (10,000 mbit / 8 bits per byte).

Binary Multipliers

There’s also KiB (Kibi Byte) and Kib (Kibibit), where kibi is a 1024 multiplier, and not 1,000. GiB (GibiByte) and TiB (TibiByte) are 1024² and 1024³, respectively.

The idea is to be native to the binary numbers, rather than multiples of 10 (decimal).

We don’t tend to use those measurements in network or storage transmit/receive rates, but it’s showing up more and more in raw storage measurements.

Overhead

SATA I, II, and III are 1.5, 3, and 6 Gigabits/second respectively. They push 150, 300, and 600 MB/s respectively. You’ll probably note that math doesn’t check out: 6 Gigabits/second divided by 8 bits in a byte is 750 MB/s, not 600 MB/s, so where did the extra 150 MB/s go? I’ll cover that in the next article.

Microsoft Storage Spaces Is Hot Garbage For Parity Storage

December 17, 2018 17 Comments

I love parity storage. Whether it’s traditional RAID 5/6, erasure coding, raidz/raid2z, whatever. It gives you redundancy on your data without requiring double the drives that mirroring or mirroring+stripping would require.

The drawback is write performance is not as good as mirroring+stripping, but for my purposes (lots of video files, cold storage, etc.) parity is perfect.

In my primary storage array, I use double redundancy on my parity, so effectively N+2. I can lose any 2 drives without losing any data.

I had a simple Storage Spaces mirror on my Windows 10 Pro desktop which consisted of (2) 5 TB drives using ReFS. This had four problems:

It was getting close to full
The drives were getting old
ReFS isn’t support anymore on Windows 10 Pro (need Windows 10 Workstation)
Dropbox (which I use extensively) is dropping support for ReFS-based file systems.

ReFS had some nice features such as checksumming (though for data checksumming, you had to turn it on), but given the type of data I store on it, the checksumming isn’t that important (longer-lived data is stored either on Dropbox and/or my ZFS array). I do require Dropbox, so back to NTFS it is.

I deal with a lot of large files (video, cold-storage VM virtual disks, ISOs, etc.) and parity storage is great for that. For boot volumes, OS, applications, and other latency-sensitive operations, it’s SSD or NVMe all the way. But the bulk of my storage requirements is, well, bulk storage.

I had a few more drives from the Best Buy Easystore sales (8 TB drive, related to the WD Reds, for about $129 during their most recent sale) so I decided to use three of them and create myself a RAID 5 array (I know there are objections to RAID 5 these days in favor of RAID 6, while I agree with some of them, they’re not applicable to this workload, so RAID 5 is fine).

So I’ve got 3 WD Easystore shucked drives. Cool. I’ll create a RAID 5 array.

2018-11-17_18-12-15

Shit. Notice how the RAID-5 section is grayed out? Yeah, somewhere along the line Windows removed the ability to create RAID 5 volumes in their non-server operating systems. Instead Microsoft’s solution is to use the newer Storage Spaces. OK, fine. I’ll use storage spaces. There’s a parity option, so like RAID 5, I can do N+1 (or like RAID 6, N+2, etc.).

I set up a parity storage space (the UI is pretty easy) and gave it a quick test. At first, it started sending at 270 MB/s, then it dropped off a cliff to… 32 MB/s.

SphericalEqualAbyssiniancat

That’s it. 32 MB/s a second. What. The. Eff. I’ve got SD cards that can write faster. My guess is that some OS caching was allowing it to copy at 270 MB/s (the hard drives aren’t capable of 270 MB/s). But the hard drives ARE capable of far more than 32 MB/s. Tom’s Hardware found the Reds capable of 200 MB/s sequential writes. I was able to get 180 MB/s with some file copies on a raw NTFS formatted drive, which is inline with Tom’s Hardware’s conclusion.

Now, I don’t need a whole lot of write performance for this volume. And I pretty much only need it for occasional sequential reads and writes. But 32 MB/s is not enough.

I know what some of you are thinking. “Well Duh, RAID 5/parity is slower for writes because of the XOR calculations”.

I know from experience on similar (and probably slower) drives, that RAID 5 is not that slow, even on spinning disks. The XOR calculations are barely a blip in the processor for even halfway modern systems. I’ve got a Linux MD RAID system, with 5 drives and I can get ~400 MB/s of writes (from a simple dd write test).

While it’s true RAID 5 writes are slower than say, RAID 10, they’re not that slow. I set up a RAID 5 array on a Windows Server 2016 machine (more on that later) using the exact same drives it was able to push 113 MB/s.

It might have been able to do more, but it was limited by the bottleneck of the Ethernet connection (about 125 MB/s) and the built-in Dell NIC. I didn’t have an SSD to install Windows Server 2016 on and had to a use a HDD that was slower than the drives the RAID 5 array was built with so that’s the best I could do. Still, even if that was the maximum, I’ll be perfectly happy with 113 MB/s for sequential writes.

So here’s where I got crafty. The reason I had a Windows 2016 server was that I thought if I created a RAID 5 volume in Windows 2016 (which you can) I could simply import the volume into Windows 10 Pro.

Unfortunately, after a few attempts, I determined that that won’t work.

The volume shows failed and the individual drives show failed as well.

So now I’m stuck with a couple of options:

Fake RAID
Drive mirroring
Parity but suck it up and deal with 32 MB/s
Parity and buy a pair of small SSDs to act as cache to speed up writes
By a Hardware RAID Card

Fake Hardware RAID

Early on in my IT career, I’d been fooled by fake RAID. Fake RAID is the feature that many motherboards and inexpensive SATA cards offer: You can setup RAID (0, 1, 5 typically) in the motherboard BIOS.

But here’s the thing: It’s not a dedicated RAID card. The RAID operations are done by the general CPU. It has all the disadvantages of hardware RAID (difficult to troubleshoot, more fragile configurations, very difficult to migrate) and none of the advantages (hardware RAID offloads operations to a dedicated CPU on the RAID card, which fake RAID doesn’t have).

For me, it’s more important to have portability of the drives (just pull disks out of one system and into another). So fake RAID is out.

Drive Mirroring

Having tested drive mirroring performance, it’s definitely a better performing option.

Parity with Sucky Performance

I could just suck it up and deal with 32 MB/s. But I’m not going to. I don’t need SSD/NVMe speeds, but I need something faster than 32 MB/s. I’m often dealing with multi-gigabit files, and 32 MB/s is a significant hindrance to that.

Parity with SSD Cache

About $50 would get me two 120 GB SSDs. As long as I wasn’t doing a massive copy beyond 120 GBs of data, I should get great performance. For my given workload of bulk storage (infrequent reads/writes, mostly sequential in nature) this should be fine. The initial copy of my old mirrored array is going to take a while, but that’s OK.

The trick with an SSD cache is that you have to use PowerShell in order to configure it. The Windows 10 GUI doesn’t allow it.

After some fiddling, I was able to get a Storage Space going with SSD cache.

And… the performance was worse than with the drives by itself. Testing the drives by themselves, I found the that the SSDs had worse sequential performance than the spinning rust. I’d assumed the SSDs would do better, a silly assumption now that I think about it. At least I’m out only $50, and I can probably re-purpose them for something else.

The performance for random I/O is probably better, but that’s not what my workload is on these drives. My primary need is sequential performance for this volume.

Buy A Hardware RAID Card

I don’t like hardware RAID cards. They’re expensive, the software to manage them tends to be really awful, and it make portability of drives a problem. With software RAID, I can pull drives out of one system and put them into another, and voila, the volume is there. That can be done with a hardware RAID card, but it’s trickier.

The performance benefit that they provide is just about gone too, given how fast modern CPUs are and how many cores they have, compared to the relatively slow CPUs on hardware RAID cards (typically less than a GHz, and only one or two cores).

Conclusion

So in the end, I’m going with a mirrored pair of 8 TB drives, and I have two more drives I can add when I want to bring the volume to 16 TB.

Thoughts On Why Storage Spaces Parity Is Such Hot Fucking Garbage

There’s a pervasive thought in IT that parity storage is very slow unless you have a dedicated RAID card. While probably true at one time, much like the jumbo frame myth, it’s no longer true anymore. A halfway modern CPU is capable of dozens of Gigabytes per second of RAID 5/6 or whatever parity/erasure coding. If you’re just doing a couple hundred megabytes per second, it’s barely a blip in the CPUs.

It’s the reason huge honking storage arrays (EMC, Dell, NetApp, VMware VSAN etc.) don’t do RAID cards. They just (for the most part) throw x86 cores at it through either scale-up or scale-out controllers.

So why does Storage Space parity suck so bad? I’m not sure. It’s got to be an implementation problem. It’s definitely not a CPU bottleneck. It’s a shame too, because it’s very easy to manage and more flexible than traditional software RAID.

(way)TL;DR

Tried parity in storage spaces. It sucked bigtime. Tried other shit, didn’t work. Just went with mirrored.

Filed under Uncategorized

ZFS on Linux with Encryption Part 2: The Compiling

December 17, 2017 3 Comments

First off: Warning. I don’t know what the stability of this feature is. It’s been in the code for a couple of months, it hasn’t been widely used. I’ve been testing it, and so far it’s worked as expected.

In exploring native encryption, I attempted to get it on Linux/ZFS using the instruction on this site: https://blog.heckel.xyz/2017/01/08/zfs-encryption-openzfs-zfs-on-linux/. While I’m sure they worked at the time, the code in the referenced non-standard repos has changed and I couldn’t get anything to compile correctly.

After trying for about a day, I realized (later than I care to admit) that I should have just tried the standard repos. They worked like a charm. The instructions below compiled and successfully installed ZFS on Linux with dataset encryption on both Ubuntu 17.10 and CentOS 7.4 in the November/December 2017 time frame.

Compiling ZFS with Native Encryption

The first step is to make sure a development environment is installed on your Linux system. Make sure you have compiler packages, etc. installed. Here’s a few packages for CentOS you’ll need (you’ll need similar packages/libraries for whatever platform you run).

openssl-devel
attr, libattr-devel
libblkid-devel
zlib-devel
libuuid-devel

The builds were pretty good at telling you what packages you needed if they were missing, so of course install any that are requested.

You’ll need to build the SPL code and the ZFS code.

First, build the SPL code.

git clone https://github.com/zfsonlinux/spl
cd spl
./autogen
./configure
make
make install

Then the ZFS code:

git clone https://github.com/zfsonlinux/zfs
cd zfs
./autogen
./configure --prefix=/usr  # <-- This puts the binaries in /usr/sbin instead of /usr/local/sbin
make
make install

If you try the zfs command right away, you’ll probably get something similar to the following:

/sbin/zfs: error while loading shared libraries: libnvpair.so.1: cannot open shared object file: No such file or directory

Running ldconfig usually fixes that.

You might need to modprobe zfs to get the modules loaded, especially if you end up rebooting. There’s of course ways to auto-load the modules depending on your distribution.

Creating the zpool

zpool create -o ashift=12 storage raidz2 /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg

The -o ashift=12 is important if you have 4K sector drives, which these 8 TB WD Reds are. If you don’t throw that option in, your performance will suffer, big-time. I found my pool performed about 25% of what it did when ashift=12 was selected.

I was doing copy tests with Samba and getting only 25-30 MB/s. Once I destroyed the zpool and used ashift=12 for a new zpool on the same drives, I was able to get ~120 MB/s, which is the practical limit for a 1 Gigabit link (1,000 Gbit / 8 = 125 MB/s). Local copies were faster. Figure this out ahead of time, because to set the ashift you have to do zpool destroy, which does what it sounds like it does: Destroys the pool (and data).

The zpool will be called “storage” (yes, original) so of course use whatever name you prefer. raidz2 uses a double-parity system, so out of 6 drives, I would get a pool with the space of 4 of them (roughly 32 TBs).

The rest are the devices themselves. You don’t need to partition the drives, ZFS does it automatically.

Encryption is done on a dataset by dataset basis, which is nice to be able to have some storage be encrypted and other parts not. To create an encrypted dataset, first enable the feature in the zpool.

zpool set feature@encryption=enabled storage

Then create a new dataset under the storage zpool using a passphrase (you can also use a keyfile, but I’m opting for a passphrase):

zfs create -o encryption=on -o keylocation=prompt -o keyformat=passphrase storage/encrypted

Anything you put in /storage/encrypted/ will now be encrypted at rest.

When the system comes up, the zpool could be automatically imported (or you have to import it manually) but the /storage/encrypted/ dataset won’t be automatically added.

# zpool import storage
# zfs mount storage/encrypted -l
# Enter passphrase for 'storage/encrypted':

Once you enter the passphrase, the dataset is mounted.

Filed under Uncategorized

ZFS and Linux and Encryption Part 1: Raining Hard Drives

December 17, 2017 Leave a comment

(Skip to Part II to learn how to install ZFS with encryption on Linux)

Best Buy has been having a constant series of sales on WD Easy Store 8 TB drives. And it turns out, inside many of them (though not all) are WD Red NAS 5400 RPM drives. For $130-180 a piece, that’s significantly less than the regular price on Amazon/Newegg for these drives bare, which is around $250-$275.

(For updates on the sales, check out the subreddit DataHoarder.)

takemymoney

Over the course of several months, I ended up with 6 WD Red NAS 8 TB drives. Which is good, because my current RAID array is starting to show its age, and is also really, really full.

If you’re not familiar with the WD NAS Red’s, they’re drives specifically built to run 24/7. The regular WD Reds are 5400 RPM, so they’re a bit slower than a regular desktop drive (the Red Pro are 7200 RPM), but I don’t really care for my workload. For speed I use SSDs, and these drives for bulk storage. Plus, the slower speeds mean less heat and less power.

My current array is made of (5) 3 TB drives operating at RAID 5 for a total of about 12 TB usable. The drives are about 5 years old now, with one of them already throwing a few errors. It’s a good time to upgrade.

I’ve shucked the 8TB Reds (the process of removing the Red’s from their external case) placed the bare drives in a server case.

So now, what to do with them? I decided this was a good time to re-evaluate my storage strategy and compare my options.

My current setup is a pretty common one: It’s a Linux MD (multi-device) array with LVM (Linux Volume Manger) on top and encrypted with LUKS. It’s presented as a single 12 TB block device which has the ext4 file system on top of it.

It works relatively well, though it has a few disadvantages:

It takes a long time to build (days) and presumably a long time to rebuild if a drive fails and is replaced
It’s RAID 5, so if I lose a drive while it’s rebuilding from a previous fail, my data is toast. A common concern for RAID 5.

Here’s what I’d like:

Encryption: Between things like tax documents, customer data, and my Star Trek erotic fan fiction, I want data-at-rest encryption.
Double-parity. I don’t need the speed of RAID 10/0+1, I need space, so that means RAID5/6 or equivalent. But I don’t want to rely on just one drive, so double party (RAID 6 or equivalent).
Checksumming would be nice, but not necessary. I think the bit-rot argument is a little over-done, but I’m not opposed to it.

So that leaves me with ZFS (on FreeBSD or Linux) or Linux MD. I generally have a preference to stick with Linux, but if its something like FreeNAS, I’m not opposed to it.

Boyh ZFS and btrfs offer checksumming, however the RAID 5/6 parity implementation on btrfs has been deemed unsafe at this point. So if I want parity and checksumming (which I do), ZFS is my only option.

For checksumming to be of any real benefit the file system must control block devices directly. If you put them in a RAID as a single device and lay the checksumming filesystem on top of it, the only thing the checksumming can do is tell you that your files are fucked up. It can’t actually fix them.

Layered: File system on top of encryption on top of MD RAID array

The layered approach above is how my old array was done. It works fine, however it wouldn’t provide any checksumming benefit. Btrfs (or ZFS) would just have a single block device from its perspective, and couldn’t recover a bad checksum from another copy.

(Turns out you can have a single block device and recover from a bad checksum if you set ZFS to make more than one copy of the data, which of course takes more space)

ZFS encryption in FreeBSD and current ZFS on Linux: ZFS on top of encrypted block devices

ZFS encryption on FreeBSD and current ZFS on Linux is handled via a disk encryption layer, LUKS on Linux and Geli on FreeBSD. The entire drive is encrypted and the encrypted block devices are controlled by ZFS. You can do this with btrfs as well, but again the RAID5/6 problems makes it out of the question.

Native encryption with ZFS on Linux

New to ZFS on Linux is native encryption within the file system. You can, on a dataset by dataset basis, set encryption. It’s done natively in the file system, so there’s no need to run a separate LUKS instance.

It would be great it btrfs could do native encryption (and fix the RAID5/6 write hole). In fact, the lack of native encryption has made Red Hat pull btrfs from RHEL.

Part II is how I got ZFS with native encryption working on my file server.

Filed under Uncategorized

Do We Need Chassis Switches Anymore in the DC?

July 5, 2017 9 Comments

While Cisco Live this year was far more about the campus than the DC, Cisco did announce the Cisco Nexus 9364C, a spine-oriented switch which can run in both ACI mode and NX-OS mode. And it is a monster.

It’s (64) ports of 100 Gigabit. It’s from a single SoC (the Cisco S6400 SoC).

It provides 6.4 Tbps in 2RU, likely running below 700 watts (probably a lot less). I mean, holy shit.

9364c

Cisco Nexus 9364C: (64) ports of 100 Gigabit Ethernet.

And Cisco isn’t the only vendor with an upcoming 64 port 100 gigabit switch in a 2RU form factor. Broadcom’s Tomahawk II, successor to their 25/100 Gigabit datacenter SoC, also sports the ability to have (64) 100 Gigabit interfaces. I would expect the usual suspects to announce switches based on these soon (Arista, Cisco Nexus 3K, Juniper, etc.)

And another vendor Innovium, while far less established, is claiming to have a chip in the works that can do (128) 100 Gigabit interfaces. On a single SoC.

For modern data center fabric, which rely on leaf/spine Clos style topologies, do we even need chassis anymore?

For a while we’ve been reliant upon the Sith-rule on our core/aggregation: Always two. A core/aggregation layer is a traditional (or some might say legacy now) style of doing a network. Because of how spanning-tree, MC-LAG, etc., work, we were limited to two. This Core/Aggregation/Access topology is sometimes referred to as the “Christmas Tree” topology.

xmastree

Traditional “Christmas Tree” Topology

Because we could only have two at the core and/or aggregation layer, it was important that these two devices be highly redundant. Chassis would allow redundancy in critical components, such as fabric modules, line cards, supervisor modules, power supplies, and more.

Fixed switches tend to not have nearly the same redundancies, and as such weren’t often a good choice for that layer. They’re fine for access, but for your host’s default gateways, you’d want a chassis.

Leaf/spine Clos topologies, which relies on Layer 3 and ECMP, and isn’t restricted the same way Layer 2 spanning-tree and MC-LAG is, is seeing a resurgence after having been banished from the DC because of vMotion.

clos

Leaf/Spine Clos Topology

Modern data center fabrics utilize overlays like VXLAN to provide layer 2 adjacencies required by vMotion. And again we’re not limited to just two devices on the spine layer: You can have 2, 3, 4.. sometimes up to 16 or more depending on the fabric. They don’t have to be an even number, nor do they need to be a power of two now that most switches use a higher than 3-bit hash for ECMP (the 3-bit hash was the origin of the previous powers of 2 rule for LAG/ECMP).

Now we have an option: Do leaf/spine designs concentrate on larger, more port-dense chassis switches for the spine, or do we go with fixed 1, 2, or 4RU spines?

The benefit of a modular chassis is you can throw a lot more ports on them. They also tend to have highly redundant components, such as fans, power supplies, supervisor modules, fabric modules, etc. If any single component fails, the chassis is more likely keep on working.

They’re also upgradable. Generally you can swap out many of the components, allowing you to move from one network speed to the next generation, without replacing the entire chassis. For example, on the Nexus 9500, you can go from 10/40 Gigabit to 25/100 Gigabit by swapping out the line cards and fabric modules.

However, these upgrades are pretty expensive comparatively. In most cases, fixed spines would be far cheaper to swap out entirely compared to upgrading a modular chassis.

And redundancy can be provided by adding multiple spines. Even 2 spines gives some redundancy, but 3, 4, or more can provide better component redundancy than a chassis.

So chassis or fixed? I’m leaning more towards a larger number of fixed switches. It would be more cost effective in just about every scenario I can thing of, and still provides the same forwarding capacity of a more expensive chassis configuration.

So yeah, I’m liking the fixed spine route.

What do you think?

Filed under Uncategorized

Fibre Channel of Things (FCoT)

April 1, 2017 1 Comment

The “Internet of Things” is well underway. There are of course the hilarious bad examples of the technology (follow @internetofshit for some choice picks), but there are many valid ways that IoT infrastructure can be extremely useful. With the networked compute we can crank out for literally pennies and the data they can relay to process, IoT is here to stay.

Miele put a web server in a dishwasher for… reasons https://t.co/AZBccRD2fN

— Internet of Shit (@internetofshit) March 27, 2017

Hacking a dishwasher is the new hacking a gibson

But there’s one thing that these dishwashers, cars, refrigerators, Alexa’s, etc., all lack: Access to decent storage.

The storage on many IoT devices is either terrible or nonexistent. Unreliable flash storage or no storage at all. That’s why the Fibre Channel T19 working group created a standard for FCoT (Fibre Channel of Things). This gives small devices access to real storage, powered by arrays not cheap and unreliable local flash storage.

The FCoT suite is a combination of VXSAN and FCIP. VXSAN provides the multi-tenancy and scale to fibre channel networks, and FCIP gives access to the VXSANs from a variety of FCaaS providers over the inferior IP networks (why IoT devices chose IP instead of FC for their primary connectivity, I’ll never know). Any IoT connected device can do a FLOGI to a FCaaS service and get access to a proper block storage. Currently both Amazon Web Services and Microsoft Azure offer FCoT/FCaaS services, with Google expected to announce support by the end of June 2017.

Why FCoT?

Your refrigerator probably doesn’t need access to block storage, but your car probably does. Why? Devices that are sending back telemetry (autonomous cars are said to produce 4 TB per day) need to put that data somewhere, and if that data is to be useful, that storage needs to be reliable. FCaaS provides this by exposing Fibre Channel primitives.

Tiered storage, battery backed-up RAM cache, MLC SSDs, 15K RPM drives, these are all things that FCoT can provide that you can’t get in a mass-produced chip with inexpensive consumer flash storage.

As the IoT plays out, it’s clear that FCoT will be increasingly necessary.

Filed under Uncategorized

← Older posts

Newer posts →

The Data Center Overlords

Wow: NVMe and PCIe Gen 4

For ESXi: Realtek NICs Are Awful And Don’t Use Them

Certification Exam Questions That I Hate

To Certify, Or Not To Certify

Pedantic Trivia

Exact Commands

The Unclear Questions

Questions I Do Like

Scenario Questions

Conclusion

A Discussion On Storage Overhead

PAM 4

Higher Level Protocol Overhead

A Primer for Home NAS Storage Speed Units and Abbreviations

Abbreviations

Binary Multipliers

Overhead

Microsoft Storage Spaces Is Hot Garbage For Parity Storage

Fake Hardware RAID

Drive Mirroring

Parity with Sucky Performance

Parity with SSD Cache

Buy A Hardware RAID Card

Conclusion

Thoughts On Why Storage Spaces Parity Is Such Hot Fucking Garbage

(way)TL;DR

ZFS on Linux with Encryption Part 2: The Compiling

Compiling ZFS with Native Encryption

Creating the zpool

ZFS and Linux and Encryption Part 1: Raining Hard Drives

Do We Need Chassis Switches Anymore in the DC?

Fibre Channel of Things (FCoT)

Check Ze Tweets