Do We Need 7200 RPM Drives?

Right now, all of my personal computers (yeah, I have a lot) now boot from SSD. I have a MacBook Pro, a MacBook Air, and a Windows 7 workstation, all booting from SSD. And the ESXi host I have will soon have an SSD datastore.

And let me reiterate what I’ve said before: I will never have a computer that boots from spinning rust again. The difference between a computer with an SSD and a computer with a HDD is astounding. You can take even a 3 year old laptop, put an SSD in there, and for the most part it feels way faster than the latest 17 inch beast running with a HDD.

Yeah yeah yeah, SSD from your cold, dead hands

So why are SSDs so bad-ass? Is it the transfer speeds? No, it’s the IOPS. The transfer speeds in SSDs are a couple of times better than an a HDD, but the IOPS are orders of magnatude better. And for desktop operating systems (as well as databases), IOPS are where it’s at. Check out this graph (bottom of page) comparing an SSD to several HDD, some of which run at 15,000 RPM.

As awesome an unicorny as that is, SSD storage still comes at a premium. Even with the spike in prices caused by the tragic flooding in Thailand, SSDs are still significantly more expensive per GB than HDDs. So it doesn’t make sense to make all of our storage SSD. There’s still a need for inexpensive, slow bulk storage, and that’s where HDDs shine.

But now that we have SSDs for speed, 7200 RPM is overkill for our other needs. I just checked my iTunes directory, and it’s 250 GB of data. There’s nothing that MP3 sound files, HD video files, backups, etc. need in terms of performance that would necessitate a 7200 RPM drive.  A 5400 RPM drive will do just fine. You might notice the difference while copying files, but the difference won’t be that great when compared to a 7200 RPM drive. Neither are in any position to flood a SATA2 connection, let alone SATA3.

Even with those USB portable hard drives which have 5400 RPM drives in them, it’s still more than enough to flood USB 2.0.

And this got me thinking: How useful are 7200 RPM drives anymore? I remember taking a pair of hard drives back to Fry’s because I realized they were 5400 RPM drives (I wasn’t paying attention). Now, I don’t care about RPMs. Any speed will do for my needs.

Hard drives have been the albatross of computer performance for a while now. This is particularly true for desktop operating systems: They eat up IOPS like candy. A spinning disk is hobbled by the spindle. In data centers you can get around this by adding more and more spindles into some type of array, thereby increasing IOPS.

Enterprise storage is another matter. It’s not likely Enterprise SANs will give up spinning rust any time soon. Personally, I’m a huge fan of company’s like PureStorage and StorageFire that have all-SSD solutions. The IOPS you can get from these all-flash arrays is astounding.

The Problem

One recurring theme from virtually every one of the Network Field Day 2 vendor presentations last week (as well as the OpenFlow symposium) was affectionately referred to as “The Problem”.

It was a theme because, as vendor after vendor gave a presentation, they essentially said the same thing when describing the problem they were going to solve. For us the delegates/bloggers, it quickly went from the problem to “The Problem”. We’d heard it over and over again so often that during the (5th?) iteration of the same problem we all started laughing like a group of Beavis and Butt-Heads during a vendor’s presentation, and we had to apologize profusely (it wasn’t their fault, after all).

Huh huhuhuhuhuh… he said “scalability issues”

In fact, I created a simple diagram with some crayons brought by another delegate to save everyone some time.

Hello my name is Simon, and I like to do draw-wrings

But with The Problem on repeat it became very clear that the majority of networking companies are all tackling the very same Problem. And imagine the VC funding that’s chasing the solution as well.

So what is “The Problem”? It’s multi-faceted and interrelated set of issues:

Virtualization Has Messed Things Up, Big Time

The biggest problem of them all was caused by the rise of virtualization. Virtualization has disrupted much of the server world, but the impact that it’s had on the network is arguably orders of magnitude greater. Virtualization wants big, flat networks, just when we got to the point where we could route Layer 3 as fast as we could switch Layer 2. We’d just gotten to the point where we could get our networks small.

And it’s not just virtualization in general, much of its impact is the very simple act of vMotion. VMs want to keep their IPs the same when they move, so now we have to bend over backwards to get it done. Add to the the vSwitch sitting inside the hypervisor, and the limited functionality of that switch (and who the hell manages it anyway? Server team? Network team?)

4000 VLANs Ain’t Enough

If you’re a single enterprise running your own network, chances are 4000+ VLANs are sufficient (or perhaps not). In multi-tenant environments with thousands of customers, 4000+ VLANs quickly becomes a problem. There is a need for some type of VLAN multiplier, something like QinQ or VXLAN, which gives us 4096 times 4096 VLANs (16 million or so).

Spanning Tree Sucks

One of my first introductions to networking was accidentally causing a bridging loop on a 10 megabit Ethernet switch (with a 100 Mbit uplink) as a green Solaris admin. I’d accidentally double-connected a hub, and I noticed the utilization LED on the switch went from 0% to 100% when I plugged a certain cable in. I entertained myself with plugging in and unplugging the port to watch the utilization LED flucutate (that is, until the network admin stormed in and asked what the hell was going on with his network).

And thus began my love affair with bridging loops. After the Brocade presentation where we built a TRILL-based Fabric very quickly, with active-active uplinks and nary a port in blocking mode, Ethan Banks became a convert to my anti-spanning tree cause.

OpenFlow offers an even more comprehensive (and potentially more impressive) solution as well. More on that later.

Layer 2 Switching Isn’t Scaling

The current method by which MAC addresses are learned in modern switches causes two problems: Only one viable path can be allowed at a time (only way to prevent loops is to prevent multiple paths by blocking ports), and large Layer 2 networks involve so many MAC addresses that it doesn’t scale.

From QFabric, to TRILL, to OpenFlow (to half a dozen other solutions), Layer 2 transforms into something Layer 3-like. MAC addresses are routed just like IP addresses, and the MAC address becomes just another tuple (another recurring word) for a frame/packet/segment traveling from one end of your datacenter to another. In the simplest solution (probably TRILL?) MAC learning is done at the edge.

There’s A Lot of Shit To Configure

Automation is coming, and in a big way. Whether it’s a centralized controller environment, or magical software powered by unicorn tears, vendors are chomping at the bit to provide some sort of automation for all the shit we need to do in the network and server world. While certainly welcomed, it’s a tough nut to crack (as I’ve mentioned before in Automation Conundrum).

Data center automation is a little bit like the Gom Jabbar. They tried and failed you ask? They tried and died.

“What’s in the box?”

“Pain. And an EULA that you must agree to. Also, man-years of customization. So yeah, pain.”

Ethernet Rules Everything Around Me

It’s quite clear that Ethernet has won the networking wars. Not that this is any news to anyone who’s worked in a data center for the past ten years, but it has struck me that no other technology has been so much as even mentioned as one for the future. Bob Metcalfe had the prophetic quote that Stephen Foskett likes to use: “I don’t know what will come after Ethernet, but it will be called Ethernet.”

But there are limitations (Layer 2 MAC learning, virtualization, VLANs, storage) that need to be addressed for it to become what comes after Ethernet. Fibre Channel is holding ground, but isn’t exactly expanding, and some crazy bastards are trying to merge the two.

Oof. Storage.

Most people agree that storage is going to end up on our network (converged networking), but there are as many opinions on how to achieve this network/storage convergence as there are nerd and pop culture reference in my blog posts. Some companies are pro-iSCSI, others pro FC/NFS, and some like Greg Ferro have the purest of all hate: He hates SCSI.

“Yo iSCSI, I’m really happy for you and imma let you finish, but Fibre Channel is the best storage protocol of all time”

So that’s “The Problem”. And for the most part, the articles on Networking Field Day, and the solutions the vendors propose will be framed around The Problem.

Fsck It, We’ll Do It All With SSDs!

At Tech Field Day 8, we saw presentations from two vendors that had an all-flash SAN offering, taking on a storage problem that’s been brewing in data centers for a while now, and the skewed performance/capacity scale.

While storage capacity has been increasing exponentially, storage performance hasn’t caught up nearly that fast. In fact, performance has been mostly stagnant, especially in the area where it counts: Latency and IOPS (I/O Operations Per Second).

It’s all about the IOPS, baby

In modern data centers, capacity isn’t so much of an issue with storage. Neither is the traditional throughput metric, such as megabytes per second. What really counts is IOPS and latency/seek time. Don’t get me wrong, some data center applications certainly have capacity requirements, as well as potential throughput requirements, but for the most part these are easily met by today’s technology.

IOPS and latency are super critical for virtual desktops (and desktops in general) and databases. If you computer is sluggish, it’s probably not a lack of RAM or CPU, by and large it’s a factor of IOPS (or lack thereof).

There are a few tricks that storage administrators and vendors have up their sleeve to increase IOPS and drop latency.

In a RAID array, you can scale IOPS linerally by just throwing more disks at the array. If you have a drive that does 100 IOPS per second, add a second drive for a RAID 0 (mirror) and you’ve got double the IOPS. Add a third and you’ve got 300 IOPS (and of course add more for redundancy).

Another trick that storage administrators have up their sleeve is the technique known as “short stroking“, where only a portion of the drive is used. In a spinning platter, the outside is spinning the fastest, giving the best performance. If you only format that out portion, the physical drive head doesn’t have to travel as far. This can reduce seek time substantially.

Tiered storage can help with both latency and IOPS, were a combination of NVRAM, SSDs, and hard drives are combined.”Hot” data is accessed from high-speed RAM cache, “warm” data is on a bank of SSDs, and “cold” data would be stored on cheaper SAS or (increasingly) consumer SATA drives.

And still our demand for IOPS is insatiable, and the tricks in some cases aren’t catching up. Short stroking only goes so far, and cache misses can really impact performance for tiered storage. While IOPS scale linearly, the IOPS we need can sometimes end up with racks full of spinning rust, while only using a tenth of the actual capacity. That’s a lot of wasted space and wasted power.

And want to hear a depressing fact?

A high-end enterprise SAS 15,000 RPM drive (which spins faster than most jet engines) gives you about 150 IOPS in performance (depending on the workload of course). A good consumer grade SSD from Newegg gives you around 85,000 IOPS. That means you would need almost 600 drives to equal the performance of one consumer grade SSD.

That’s enough to cause anyone to have a Bill O’Reilly moment.

600 drives? Fuck it, we’ll do it with all flash!

No one is going to put their entire database or virtual desktop infrastructure on a single flash drive of course. And that’s where vendors like Pure Storage and SolidFire come into play. (You can see Pure Storage’s presentation at Tech Field Day 8 here. SolidFire’s can be seen here.)

The overall premise with the we’ll-do-it-all-in-flash play is that you can take a lot of consumer grade flash drives, use the shitload of IOPS that they bring, and combine it with a lot of storage controller CPU power for deduplication and compression. With that combination, they can offer an all-flash based array at the same price per gig as traditional arrays comprise of spinning rust (disk drives).

How many IOPS are we talking about? SolidFire’s SF3010 claims 50,000 IOPS per 1 RU node. That would replace over 300 drives of traditional drives, which I don’t think you can put in 1RU. Pure Storage claims 300,000 IOPS in 8U of space. With a traditional array, you’d need over 2000 drives, also unlikely to fit in 8 RU. Also, imagine the power savings, with only 250 watts needed for SoldFire’s node, and 1300 Watts for the PureStorage cluster.  And Both allow you to scale up by adding more nodes.

You wire them into your SAN the traditional ways, as well. The Pure Storage solution has options for 10 Gbit iSCSI and 8 Gbit Fibre Channel, while the SolidFire solution is iSCSI only. (Sadly, neither support FCoE or FCoTR.)

For organizations that are doing virtual desktops or databases, an all-flash storage array with the power savings and monster IOPS must look more tantalizing than a starship full of green-skinned girls does to Captain Kirk.

There is a bit of a controversy in that many of the all-flash vendors will tell you capacity numbers with deduplication and compression taken into account. At the same time, if the performance is better than spinning rust even with the compression/dedupe, then who cares?

So SSD it is. And as anyone who has an SSD in their laptop or desktop will tell you, that shit is choice. Seriously, I get all Charlton Heston about my SSD.

You’ll pry this SSD from my cold, dead hands

It’s not all roses and unicorn-powered SSDs. There are two issues with the all-flash solution thus far. One is that they don’t have a name like NetApp, EMC, or Fujitsu, so there is a bit of a trust issue there. The other issue is that many have some negative preconceptions about flash, such as they have a high failure rate (due to a series of bad firmwares from vendors) and the limited write cycle of memory cells (true, but mitigaitable). Pure Storage claims to have never had a drive fail on them (Amy called them flash driver whispers).

Still though, check them (and any other all-SSD vendor) out. This is clearly the future in terms of high performance storage where IOPS is needed. Spinning rust will probably rule the capacity play for a while, but you have to imagine its days are numbered.

FCoE: I’m not Dead! Arista: You’ll Be Stone Dead in a Moment!

I was at Arista on Friday for Tech Field Day 8, and when FCoE was brought up (always a good way to get a lively discussion going), Andre Pech from Arista (who did a fantastic job as a presenter) brought up an article written by Douglas Gourlay, another Arista employee, entitled “Why FCoE is Dead, But Not Buried Yet“.

FCoE: “I feel happy!”

It’s an interesting article, because much of the player-hating seems to directed at TRILL, not FCoE, and as J Metz has said time and time again, you don’t need TRILL to do FCoE if you do FCoE the way Cisco does (by using Fibre Channel Forwarders in each FCoE switch). Arista, not having any Fibre Channel skills, can’t do it this way. If they were to do FCoE, Arista (like Juniper) would need to do it the sparse-mode/FIP-snooping FCoE way, which would need a non-STP way of handling multi-pathing such as TRILL or SPB.

Jayshree Ullal, The CEO of Arista, hated on TRILL and spoke highly of VXLAN and NVGRE (Arista is on the standards body for both). I think part of that is that like Cisco, not all of their switches will be able to support TRILL, since TRILL requires new Ethernet silicon.

Even the CEO of Arista acknowledged that FCoE worked great at the edge, where you plug a server with a FCoE CNA into an FCoE switch, and the traffic is sent along to native Ethernet and native Fibre Channel networks from there (what I call single-hop or no-hop FCoE). This doesn’t require any additional FCoE infrastructure in your environment, just the edge switch. The Cisco UCS Fabric Interconnects are a great example of this no-hop architecture.

I don’t think FCoE is quite dead, but I have to imagine that it’s not going as well as vendors like Cisco have hoped. At least, it’s not been the success that some vendors have imagined. And I think there are two major contributors to FCoE’s failure to launch, and both of those reasons are more Layer 8 than Layer 2.

Old Man of the Data Center

Reason number one is also the reason why we won’t see TRILL/Fabric Path deployed widely: It’s this guy:

Don’t let him trap you into hearing him tell stories about being a FDDI bridge, whatever FDDI is

The Catalyst 6500 series switch. This is “The Old Man of the Data Center”. And he’s everywhere. The switch is a bit long in the tooth, and although capacity is much higher on the Nexus 7000s (and even the 5000s in some cases), the Catalyst 6500 still has a huge install base.

And it won’t ever do FCoE.

And it (probably) won’t ever do TRILL/Fabric Path (spanning-tree fo-evah!)

The 6500s aren’t getting replaced in significant numbers from what I can see. Especially with the release of the Sup 2T supervisor for the 6500es, the 6500s aren’t going anywhere anytime soon. I can only speculate as to why Cisco is pursuing the 6500 so much, but I think it comes down to two reasons:

Another reason why customers haven’t replaced the 6500s are that the Nexus 7000 isn’t a full-on replacement. With no service modules, limited routing capability (it just recently got the ability to do MPLS), and a form factor that’s much larger than the 6500 (although the 7009 just hit the streets with a very similar 6500 form factor, which begs the question: Why didn’t Cisco release the 7009 first?).

Premature FCoE

So reason number two? I think Cisco jumped the gun. They’ve been pushing FCoE for a while, but they weren’t quite ready. It wasn’t until July 2011 that Cisco released NX-OS 5.2, which is what’s required to do multi-hop FCoE in the Nexus 7000s and MDS 9000. They’ve had the ability to do multi-hop FCoE in the Nexus 5000s for a bit longer, but not much. Yet they’ve been talking about multi-hop for longer than it was possible to actually implement. Cisco has had a multi-hop FCoE reference architecture posted since March 2011 on their website, showing a beautifully designed multi-hop FCoE network with 5000s, 7000s, and MDS 9000s, that for months wasn’t possible to implement. Even today, if you wanted to implement multi-hop FCoE with Cisco gear (or anyone else), you’d be a very, very early adopter.

So no, I don’t think FCoE is dead. No-hop FCoE is certainly successful (even Arista’s CEO acknowedged as such), and I don’t think even multi-hop FCoE is dead, but it certainly hasn’t caught on (yet). Will multi-hop FCoE catch on? I’m not sure. We’ll have to see.

It’s All About The IOPS, Baby

“It’s all about the benjamins, baby” – Puff Daddy

One of the new terms network administrators are starting to wrap our brains around is the term IOPS, or I/O Operations Per Second.  Traditionally, IOPS have been of the utmost importance to data base administrators, but haven’t shown up on our radar as network admins.

So what is an IOP? It’s a transaction with the disk. How long does it take a storage client to send a command to a storage operation (write or read) and receive the information or confirmation the data has been written.  Applications like databases, desktops, and virtual desktops are fairly sensitive to IOPS.  You know your laptop is suffering from deficient IOPS when you hear the grinding noise of the laptop when you open up too many VMs or too many apps simultaneously.

Traditional hard drives (spinning rust as Ivan Pepelnjak likes to call them) have varying degrees of IOPS. Your average 7200 RPM SATA drive can perform roughly 80 IOPS. A really fast 15,000 RPM SAS drive can do around 180 IOPS (figures are from Wikipedia, and need citation). Of course, it all depends on you measure an IOP (reads, writes, how big is the read or write), but those are general figures it keep in mind.  One IO operation make take place on one part of a disk, and the next IO operating may take place on a completely different.

That may sound like not a lot, and you’re right, it isn’t.  For databases and virtual desktops, we tend to need lots of IOPS. IOPS scale almost linearly by adding more disks. Mo spindles, mo IOPS. That’s one of the primary purposes of SANs, to provide access to lots and lots of disks. Some servers will let you cram up to 16 drives in them, although most have space for 4 or 5. A storage array on a SAN can provide access to a LUN comprised of hundreds or even thousands of disks.

Single hard drives are pretty good at sequential reading and writing, which is different than IOPs. If you have a DVR at home, it depends mostly on sequential reading and writing.  A 7200 RPM can do about 50-60 Megabytes/second of sequential reads, less in writes. And that’s the absolute best case scenario: A single file located in a contiguous space on the platter (so the spindle doesn’t have to bounce around the platter getting bits of data here and there).  If the file is discontiguous, or lots of smaller files all located in various areas of the platter, the numbers will go down significantly.

What does that mean to network admins? A single 7200 RPM SATA drive can push about 400-500 Megabits per second on your network.

Bandwidth

In networking, we tend to think in terms of bandwidth.  The more the better.  We also think of things like latency and congestion, but bandwidth is usually on our minds. In storage, it’s not so much bandwidth as IOPS.

Let’s say I tell you that I have two different storage arrays for you to choose from.  One of the arrays is accessible via a 1 Gbit end-to-end FC connection.  The other array is connected via accessible via an 8 Gbit end-to-end FC connection. You know nothing of how the arrays are configured, or how many drives are in them. They both present to you a drive (LUN) with a 2 Terabyte capacity.

Which one do you chose?

The networkers in us would tend to go for the 8Gbit FC connected array.  After all, 8 Gbit FC is 8 times faster than 1 Gbit FC.  There’s no arguing that.

But is it the right choice? Maybe not.

Now you learn a little more about the array.  The LUN in the 1 Gbit storage array is comprised of 20 drives in a RAID 10 setup with 100 IOPS per drive.   That’s about 1,000 total IOPS.

The 8 Gbit FC connected array is comprised of (2) two terabyte drives at 100 IOPS per drive in a RAID 1 array (two drives, mirrored).  That’s 100 IOPs.

So while the 8 Gbit array has 8 times the bandwidth, the 1 Gbit array has 10 times the IOPS.  Also, the (2) two terabytes drives aren’t going to be able to push 800 Megabytes per second that 8 Gbit FC would provide.  They’d be lucky to push 80 Megabytes.

If you’re using those LUNs for databases or virtualization (especially desktop virtualization), you’re going to want IOPS.

Operating systems love IOPS, especially desktop operating systems. Chances are the slowest part of your computer is your spinning rust. When your computer slows to a crawl, you’ll typically hear the hard drive grinding away.

Solid State Drives

SSDs have changed the game in terms of performance. While your highest performing enterprise spinning disk can do about 180 IOPS, the SSD you get from Best Buy can do over 5,000.  (By the way, I highly recommend your next laptop have an SSD drive).

There are hybrid solutions that use combinations of SATA, SSDs, and battery-backed up RAM to provide a good mix of IOPS and economy.  There are a lot of other interesting solutions too to provide access to lots of IOPS.