I, For One, Welcome Our New OpenFlow Overlords

When I first signed up for Networking Field Day 2 (The Electric Boogaloo), I really had no idea what OpenFlow was. I’d read a few articles, listened to a few podcasts, but still only had a vague idea of what it was. People I respect highly like Greg Ferro of Packet Pushers were into it, so it had my attention. But still, not much of a clue what it was. I attended the OpenFlow Symposium, which preceeded the activites of Networking Field Day 2, and had even less of an idea of what it was.

Then I saw NEC (really? NEC?) do a demonstration. And my mind was blown.

Side note: Let this be a lesson to all vendors. Everything works great in a PowerPoint presentation. It also conveys very little about what a product actually does. Live demonstrations are what get grumpy network admins (and we’re all grumpy) giddy like schoolgirls at  Justin Bieber concert. You should have seen Ivan Pepelnjak

I’m not sure if I got all my assumptions right about OpenFlow, so feel free to point out if I got something completely bone-headedly wrong. But from what I could gather, OpenFlow could potentially do a lot of things:

  • Replace traditional Layer 2 MAC learning and propagation mechanisms
  • Replace traditional Layer 3 protocols
  • Make policy-based routing (routing based on TCP/UDP port) something useful instead of a one-off, pain in the ass, ok just-this-one time creature it is now
  • Create “traceroute on steroids”

Switching (Layer 2)

Switching is, well, rather stupid. At least learning MAC addresses and their locations are. To forward frames, switches need to learn which ports to find the various MAC addresses. Right now the only way they learn about it is listening to the cacophony of hosts broadcasting and spewing frames. And when one switch learns a MAC address, it’s not like it tells the others. No, in switching, every switch is on its own for learning. In a single Layer 2 domain, every switch needs to learn where to find every MAC address on its own.

Probably the three biggest consequences of this method are as follows

  • No loop avoidance. The only way to prevent loops is to prevent redundant paths (i.e. spanning-tree protocol)
  • Every switch in a Layer 2 domain needs to know every frickin’ MAC address. The larger the Layer 2 domain, the more MAC addresses need to be learned. Suddenly, a CAM table size of 8,000 MAC addresses doesn’t seem quite enough.
  • Broadcasts like woah. What happens when a switch gets a frame that it doesn’t have a CAM entry for? BROADCAST IT OUT ALL PORTS BUT THE RECEIVING PORT. It’s the all-caps typing of the network world.
For a while in the early 2000’s we could get away with all this. Multi-layer switches (switches that did Layer 3 routing as well) got fast enough to route as fast as they could switch, so we could easily keep our Layer 2 domains small and just route everything.

That is, until VMware came and screwed it all up. Now we had to have Layer 2 domains much larger than we’d planned for. 4,000 entry CAM tables quickly became cramped.

MAC learning would be more centralized with OpenFlow. ARP would still be there at the edge, so a server would still think it was communicating with a regular switch network. But OpenFlow could determine which switches need to know what MAC addresses are where, so every switch doesn’t need to learn everything.

And no spanning-tree. Loop avoidance is prevented by the OpenFlow controller. No spanning-tree (although you can certainly do spanning-tree at the edge to communicate with legacy segments).

Routing (Layer 3)

Routing isn’t quite as stupid as switching. There are a number of good protocols out there that will scale pretty well, but it does require configuration on each device. It’s dynamic in that it can do multi-pathing (where traditional Layer 2 can’t), as well as recover from dead links without taking down the network for several (dozens of) seconds.  but it doesn’t quite allow for centralized control, and it has limited dynamic ability. For instance, there’s not mechanism to do “oh, hey, for right now why don’t we just move all these packets from this source to that source” in an efficient way. Sure, you can inject some host routes to do that, but it’s got to come from some sort of centralized controller.

Flow Routing (Layer 4)

So why stop at Layer 3? Why not route based on TCP/UDP header information? It can be done with policy-based routing (PBR) today, but it’s not something that can be communicated from router to router (OSPF cares not how you want to direct a TCP port 80 flow versus a TCP port 443 flow).  There is also WCCP, the Web Cache Communication Protocol, which today is not used for web caches, but WAN Optimization Controllers, like Cisco’s WAAS, or Cisco’s sworn enemy, Riverbed (seriously, just say the word ‘Riverbed’ at a Cisco office).

Sure it’s watery and tastes like piss, but at least it’s not policy-based routing

A switch with modern silicon can look at Layer 3 and Layer 4 headers as easily as they can look at Layer 2 headers. It’s all just bits in the flow, man. OpenFlow takes advantage of this, and creates, for lack of a cooler term, a Layer 2/3/4 overlord.

I, for one, welcome our new OpenFlow overlords

TCAMs or shared memory, or whatever you want to call the forwarding tables in your multi-layer switches can be programmed at will by an OpenFlow overlord, instead of being populated by the lame-ass Layer 2, Layer 3, and sometimes Layer 4 mechanisms on a switch-by-switch basis.

Since we can direct traffic based on flows throughout a multi-switch network, there’s lots of interesting things we can do with respect to load balancers, firewalls, IPS, caches, etc. Pretty interesting stuff.

Flow View (or Traceroute on Steroids)

I think one of the coolest demonstrations from NEC was when they showed the flow maps. They could punch up any source and destination address (IP or MAC) and there would be a graphical representation of the flow (and which devices they went through) on the screen. The benefits for that would be obvious. Server admin complain about slowness? Trace the flow, and check the interfaces on all the transit devices. That’s something that might take quite a while in a regular route/switch network, but can be done in a few seconds with an OpenFlow controller.

An OpenFlow Controller Tracks a Flow

To some extent, there are other technologies that can take care of some of these issues. For instance, TRILL and SPB take a good wack at the Layer 2 bullshit. Juniper’s QFabric does a lot of the ain’t-nothin-but-a-tuple thang and switches based on Layer2/3 information. But in terms of potential, I think OpenFlow has them all beat.

Don’t get too excited right now though, as NEC is the only vendor that has working implementation of OpenFlow controller, and other vendors are working on theirs. Standford apparently has OpenFlow up and running in their environment, but its all still in the early stages.

Will OpenFlow become the future? Possibly, quite possibly. But even if what we now call OpenFlow isn’t victorious, something like it will be. There’s no denying that this approach, or something similar, is a much better way to handle traffic engineering in the future than our current approach. I’ve only scratched the surface of what can be done with this type of network design. There’s also a lot that can be gained in terms of virtualization (an OpenFlow vSwitch?) as well as applications telling the network what to do. Cool stuff.

Note: As a delegate/blogger, my travel and accommodations were covered by Gestalt IT, who vendors paid to have spots during the Networking Field Day. Vendors pay Gestalt IT to present, so while my travel (hotel, airfare, meals) were covered indirectly by the vendors, no other remuneration (save for the occasional tchotchke) from any of the vendors, directly or indirectly, or by Gestalt IT was recieved. Vendors were not promised, nor did they ask for any of us to write about them, or write about them positively. In fact, we sometimes say their products are shit (when, to be honest, sometimes they are, although this one wasn’t). My time was unpaid. 

The Problem

One recurring theme from virtually every one of the Network Field Day 2 vendor presentations last week (as well as the OpenFlow symposium) was affectionately referred to as “The Problem”.

It was a theme because, as vendor after vendor gave a presentation, they essentially said the same thing when describing the problem they were going to solve. For us the delegates/bloggers, it quickly went from the problem to “The Problem”. We’d heard it over and over again so often that during the (5th?) iteration of the same problem we all started laughing like a group of Beavis and Butt-Heads during a vendor’s presentation, and we had to apologize profusely (it wasn’t their fault, after all).

Huh huhuhuhuhuh… he said “scalability issues”

In fact, I created a simple diagram with some crayons brought by another delegate to save everyone some time.

Hello my name is Simon, and I like to do draw-wrings

But with The Problem on repeat it became very clear that the majority of networking companies are all tackling the very same Problem. And imagine the VC funding that’s chasing the solution as well.

So what is “The Problem”? It’s multi-faceted and interrelated set of issues:

Virtualization Has Messed Things Up, Big Time

The biggest problem of them all was caused by the rise of virtualization. Virtualization has disrupted much of the server world, but the impact that it’s had on the network is arguably orders of magnitude greater. Virtualization wants big, flat networks, just when we got to the point where we could route Layer 3 as fast as we could switch Layer 2. We’d just gotten to the point where we could get our networks small.

And it’s not just virtualization in general, much of its impact is the very simple act of vMotion. VMs want to keep their IPs the same when they move, so now we have to bend over backwards to get it done. Add to the the vSwitch sitting inside the hypervisor, and the limited functionality of that switch (and who the hell manages it anyway? Server team? Network team?)

4000 VLANs Ain’t Enough

If you’re a single enterprise running your own network, chances are 4000+ VLANs are sufficient (or perhaps not). In multi-tenant environments with thousands of customers, 4000+ VLANs quickly becomes a problem. There is a need for some type of VLAN multiplier, something like QinQ or VXLAN, which gives us 4096 times 4096 VLANs (16 million or so).

Spanning Tree Sucks

One of my first introductions to networking was accidentally causing a bridging loop on a 10 megabit Ethernet switch (with a 100 Mbit uplink) as a green Solaris admin. I’d accidentally double-connected a hub, and I noticed the utilization LED on the switch went from 0% to 100% when I plugged a certain cable in. I entertained myself with plugging in and unplugging the port to watch the utilization LED flucutate (that is, until the network admin stormed in and asked what the hell was going on with his network).

And thus began my love affair with bridging loops. After the Brocade presentation where we built a TRILL-based Fabric very quickly, with active-active uplinks and nary a port in blocking mode, Ethan Banks became a convert to my anti-spanning tree cause.

OpenFlow offers an even more comprehensive (and potentially more impressive) solution as well. More on that later.

Layer 2 Switching Isn’t Scaling

The current method by which MAC addresses are learned in modern switches causes two problems: Only one viable path can be allowed at a time (only way to prevent loops is to prevent multiple paths by blocking ports), and large Layer 2 networks involve so many MAC addresses that it doesn’t scale.

From QFabric, to TRILL, to OpenFlow (to half a dozen other solutions), Layer 2 transforms into something Layer 3-like. MAC addresses are routed just like IP addresses, and the MAC address becomes just another tuple (another recurring word) for a frame/packet/segment traveling from one end of your datacenter to another. In the simplest solution (probably TRILL?) MAC learning is done at the edge.

There’s A Lot of Shit To Configure

Automation is coming, and in a big way. Whether it’s a centralized controller environment, or magical software powered by unicorn tears, vendors are chomping at the bit to provide some sort of automation for all the shit we need to do in the network and server world. While certainly welcomed, it’s a tough nut to crack (as I’ve mentioned before in Automation Conundrum).

Data center automation is a little bit like the Gom Jabbar. They tried and failed you ask? They tried and died.

“What’s in the box?”

“Pain. And an EULA that you must agree to. Also, man-years of customization. So yeah, pain.”

Ethernet Rules Everything Around Me

It’s quite clear that Ethernet has won the networking wars. Not that this is any news to anyone who’s worked in a data center for the past ten years, but it has struck me that no other technology has been so much as even mentioned as one for the future. Bob Metcalfe had the prophetic quote that Stephen Foskett likes to use: “I don’t know what will come after Ethernet, but it will be called Ethernet.”

But there are limitations (Layer 2 MAC learning, virtualization, VLANs, storage) that need to be addressed for it to become what comes after Ethernet. Fibre Channel is holding ground, but isn’t exactly expanding, and some crazy bastards are trying to merge the two.

Oof. Storage.

Most people agree that storage is going to end up on our network (converged networking), but there are as many opinions on how to achieve this network/storage convergence as there are nerd and pop culture reference in my blog posts. Some companies are pro-iSCSI, others pro FC/NFS, and some like Greg Ferro have the purest of all hate: He hates SCSI.

“Yo iSCSI, I’m really happy for you and imma let you finish, but Fibre Channel is the best storage protocol of all time”

So that’s “The Problem”. And for the most part, the articles on Networking Field Day, and the solutions the vendors propose will be framed around The Problem.

Your Momma Is So Proprietary

Let’s talk about a very sensitive subject for both networking admins and networking vendors: The subject of proprietary technologies.

The word proprietary in most cases has a very negative connotation. Most network designers would prefer that everything be based on open standards, like OSPF and (shudder) Spanning Tree. After all, IP and Ethernet are open standards, and those along with many other open standard technologies, make the Internet and industry what it is today. But at the same time, we can be a bit hypocritical, in that we also tend to want awesome features that are often on the propriety side.

Conversely, most network vendors would love to come up with the Kernel’s Secret Recipe that makes their stuff so awesome, that no sane engineer would dare use anything else. But they also like to say they’re open, in order to allay fears that a customer might have of being “locked in”. So when vendors go after customers, you’ll hear “open” a lot. When vendors go after each other, you hear “proprietary” thrown about as an epithet. And when a vendor is accused of being proprietary, they often lash out into an epic battle of “your momma is so proprietary”.

Proprietary Bad!

So last week there was a discussion on Twitter  between former Cisco employee and new Dell Force10 employee Brad Hedlund (@bradhedlund), and former Cisco employee and new Juniper employee Chistopher Hoff (@beaker). (By the way, they are both people I admire and respect.)

I believe they were talking about the different approaches their respective companies were taking solve the evolving needs of modern data centers. Juniper’s solution is QFabric, while Dell Force 10 is going the NVGRE/VXLAN/OpenFlow route.  Brad cited QFabric as proprietary, and Christopher Hoff countered that Cisco’s FEX is also proprietary. And while true, something about that bothered me a bit.

QFabric and FEX are both proprietary, but the effect of the proprietary is very different. With QFabric, you can build a huge network fabric, without worrying about spanning-tree, and have one control plane for a whole mesh of switches. With FEX, you can plug what looks like a switch into a Nexus 5000 or Nexus 7000, and that switch looks like a line card on the 5000/7000. FEX affects the next hop. QFabric can affect your entire data center.

FEX is pretty limited, and honestly I think it’s fairly inconsequential in terms of its proprietariness. You can use FEX, or just hang another Cisco switch off a 5K/7K, like a Nexus 3000 (with its merchant silicon) or even an Arista or Juniper box. Even if you use FEX, the effect is limited to one switch hop away. How concerned would a designer be about the effect of proprietary FEX? Pretty much it would have little effect.

The effect of QFabric, however, is potentially far more wide ranging.

That’s no moon, that’s a data center fabric

From the Packet Pushers episode (episode 51) on QFabric, Abner Germanow talks about 500 10 Gigabit Ethernet port where QFabric makes sense, which is a pretty large investment. If you figure roughly $2,000 a port, that would make it a $1,000,000 decision. If you order enough FEX/Nexus switches, you can spend that much, but you can go step by step and back out if you want.

With the proprietary versus open debate, it’s quite understandable that Juniper is very sensitive to the word “proprietary”. However, it’s tough to classify QFabric as anything but, as Ivan Pepelnjak says, “completely proprietary“.

Right now there are several open standards, such as TRILL, SPB, OpenFlow, VXLAN, FCoE, NVGRE and others looking to solve many of the same data center problems that QFabric looks to solve. And from the looks of it Juniper has been rather dismissive of some of the open fabric standard technologies, such as the much discussedWhy TRILL Won’t Work For The Data Center” argument (requires registration, fuck you TechTarget). Juniper is also taking a wait-and-see approach to VXLAN.

Even so, I don’t think Juniper should care if people call it proprietary. Yes, it’s proprietary. And yes, the effect of this proprietary-ness is huge compared to Cisco’s FEX because it affects more of the data center. But that’s a good thing.

Right now, because these open standards are mostly brand-spanking new, and no one is bat-shit crazy enough to build a multi-vendor fabric based on these new standards.

OK, maybe there is someone is crazy enough to build a multi-vendor Ethernet fabric

So QFabric has the advantage there, since even open standards are likely to be vendor-locked for now. And QFabric is a bit more mature than most of the new standards, in that it’s at least impelemtend and released. (Despite the terrible, and I mean just awful PR move bashing Juniper. Seriously, Cisco, that shit reeks sophomoric desperation. I feel cheep even linking it.)

What we do have to consider, however, is that in time the interoperability and maturity situation will be different, as it is for mature open standards today. It’s very common to have multi-vendor 802.1Q, OSPF, IS-IS, BGP, and spanning-tree deployments, without thinking twice about it. There will likely be a day when whatever new standards we’re dealing with now succeed and evolve to the point where we wouldn’t think twice about building say a TRILL fabric with multiple vendors like we do now with spanning-tree.

So QFabric is proprietary, and is not going to play well with others. That doesn’t discount it as a solution, but it is a serious consideration, more so than something like proprietary FEX. Proprietary has its advantages, and disadvantages, and the effect can be substantial or inconsequential, all factors to consider. I won’t even hazard a guess at this point as to how it’s going to play out, but like a good twitter battle, I’m going to enjoy watching.

FCoE: I’m not Dead! Arista: You’ll Be Stone Dead in a Moment!

I was at Arista on Friday for Tech Field Day 8, and when FCoE was brought up (always a good way to get a lively discussion going), Andre Pech from Arista (who did a fantastic job as a presenter) brought up an article written by Douglas Gourlay, another Arista employee, entitled “Why FCoE is Dead, But Not Buried Yet“.

FCoE: “I feel happy!”

It’s an interesting article, because much of the player-hating seems to directed at TRILL, not FCoE, and as J Metz has said time and time again, you don’t need TRILL to do FCoE if you do FCoE the way Cisco does (by using Fibre Channel Forwarders in each FCoE switch). Arista, not having any Fibre Channel skills, can’t do it this way. If they were to do FCoE, Arista (like Juniper) would need to do it the sparse-mode/FIP-snooping FCoE way, which would need a non-STP way of handling multi-pathing such as TRILL or SPB.

Jayshree Ullal, The CEO of Arista, hated on TRILL and spoke highly of VXLAN and NVGRE (Arista is on the standards body for both). I think part of that is that like Cisco, not all of their switches will be able to support TRILL, since TRILL requires new Ethernet silicon.

Even the CEO of Arista acknowledged that FCoE worked great at the edge, where you plug a server with a FCoE CNA into an FCoE switch, and the traffic is sent along to native Ethernet and native Fibre Channel networks from there (what I call single-hop or no-hop FCoE). This doesn’t require any additional FCoE infrastructure in your environment, just the edge switch. The Cisco UCS Fabric Interconnects are a great example of this no-hop architecture.

I don’t think FCoE is quite dead, but I have to imagine that it’s not going as well as vendors like Cisco have hoped. At least, it’s not been the success that some vendors have imagined. And I think there are two major contributors to FCoE’s failure to launch, and both of those reasons are more Layer 8 than Layer 2.

Old Man of the Data Center

Reason number one is also the reason why we won’t see TRILL/Fabric Path deployed widely: It’s this guy:

Don’t let him trap you into hearing him tell stories about being a FDDI bridge, whatever FDDI is

The Catalyst 6500 series switch. This is “The Old Man of the Data Center”. And he’s everywhere. The switch is a bit long in the tooth, and although capacity is much higher on the Nexus 7000s (and even the 5000s in some cases), the Catalyst 6500 still has a huge install base.

And it won’t ever do FCoE.

And it (probably) won’t ever do TRILL/Fabric Path (spanning-tree fo-evah!)

The 6500s aren’t getting replaced in significant numbers from what I can see. Especially with the release of the Sup 2T supervisor for the 6500es, the 6500s aren’t going anywhere anytime soon. I can only speculate as to why Cisco is pursuing the 6500 so much, but I think it comes down to two reasons:

Another reason why customers haven’t replaced the 6500s are that the Nexus 7000 isn’t a full-on replacement. With no service modules, limited routing capability (it just recently got the ability to do MPLS), and a form factor that’s much larger than the 6500 (although the 7009 just hit the streets with a very similar 6500 form factor, which begs the question: Why didn’t Cisco release the 7009 first?).

Premature FCoE

So reason number two? I think Cisco jumped the gun. They’ve been pushing FCoE for a while, but they weren’t quite ready. It wasn’t until July 2011 that Cisco released NX-OS 5.2, which is what’s required to do multi-hop FCoE in the Nexus 7000s and MDS 9000. They’ve had the ability to do multi-hop FCoE in the Nexus 5000s for a bit longer, but not much. Yet they’ve been talking about multi-hop for longer than it was possible to actually implement. Cisco has had a multi-hop FCoE reference architecture posted since March 2011 on their website, showing a beautifully designed multi-hop FCoE network with 5000s, 7000s, and MDS 9000s, that for months wasn’t possible to implement. Even today, if you wanted to implement multi-hop FCoE with Cisco gear (or anyone else), you’d be a very, very early adopter.

So no, I don’t think FCoE is dead. No-hop FCoE is certainly successful (even Arista’s CEO acknowedged as such), and I don’t think even multi-hop FCoE is dead, but it certainly hasn’t caught on (yet). Will multi-hop FCoE catch on? I’m not sure. We’ll have to see.

Fibre Channel and Ethernet: The Odd Couple

Fibre Channel? Meet Ethernet. Ethernet? Meet Fibre Channel. Hilarity ensues.

The entire thesis of this blog is that the traditional data center silos are collapsing. We are witnessing the rapid convergence of networking, storage, virtualization, server administration, security, and who knows what else. It’s becoming more and more difficult to be “just a networking/server/storage/etc person”.

One of the byproducts of this is the often hilarious fallout from conflicting interests, philosophies, and mentalities. And perhaps the greatest friction comes from the conflict of storage and network administrators. They are the odd couple of the data center.

Storage and Networking: The Odd Couple

Ethernet is the messy roomate. Ethernet just throws its shit all over the place, dirty clothes never end up in the hamper, and I think you can figure out Ethernet’s policy on dish washing.  It’s disorganized and loses stuff all the time. Overflow a receive buffer? No problem. Hey, Ethernet, why’d you drop that frame? Oh, I dunno, because WRED, that’s why.

WRED is the Yosamite Sam of Networking

But Ethernet is also really flexible, and compared to Fibre Channel (and virtually all other networking technologies) inexpensive. Ethernet can be messy, because it either relies on higher protocols to handle dropped frames (TCP) or it just doesn’t care (UDP).

Fibre Channel, on the other hand, is the anal-retentive network: A place for everything, and everything in its place. Fibre Channel never loses anything, and keeps track of it all.

There now, we’re just going to put this frame right here in this reserved buffer space.

The overall philosophies are vastly different between the two. Ethernet (and TCP/IP on top of it) is meant to be flexible, mostly reliable, and lossy. You’ll probably get the Layer 2 frames and Layer 3 packets from one destination to another, but there’s no gurantee. Fibre Channel is meant to be inflexible (compared with Ethernet), absolutely reliable, and loss-less.

Fibre channel and Ethernet have a very different set of philosophies in terms of building out a network. For instance, in Ethernet networks, we cross-connect the hell out of everything. Network administrators haven’t met two switches they didn’t want to cross connect.


Did I miss a way to cross-connect? Because I totally have more cables

It’s just one big cloud to Ethernet administrators. For Fibre Channel administrators, one “SAN” is abomination. There are always two, air gap separated, completely separate fabrics.

The greatest SAN diagram ever created

The Fibre Channel host at the bottom is connected into two separate, Gandalf-separated, non-overlapping Fibre Channel fabrics. This allows the host two independent paths to get to the same storage array for full redundancy. You’ll note that the Fibre Channel switches on both sides have two links from switch to switch in the same fabric. Guess what? They’re both active. Multi-pathing in Fibre Channel is allowed through use of the FSPF protocol (Fabric Shortest Path First). Fibre Channel switch to Fibre Channel switch is, what we would consider in the Ethernet world, layer 3 routed. It’s enough to give one multi-path envy.

One of the common ways (although by no means the only way) that an Ethernet frame could meet an unfortunate demise is through tail drop or WRED of a receive buffer. As a buffer in Ethernet gets full, WRED or a similar technology will typically start to randomly drop frames. As the buffer gets closer to full, the faster the frames are randomly dropped. WRED prevents tail drop, which is bad for TCP, but dropping frames when the buffer gets closer to full.

Essentially, an Ethernet buffer is a bit like Thunderdome: Many frames enter, not all frames leave. With Ethernet, if you tried to do full line rate of two 10 Gbit links through a single 10 Gbit choke point, half the frames would be dropped.

To a Fibre Channel adminsitrator, this is barbaric. Fibre Channel is much more civilized with the use of Buffer-to-Buffer (B2B) credits. Before a Fibre Channel frame is sent from one port to another, the sending port reserves space on the receiving port’s buffer. A Fibre Channel frame won’t get sent  unless there’s guaranteed space at the receiving end. This insures that no matter how much you over subscribe a port, no frames will get lost. Also, when a Fibre Channel frame meets another Fibre Channel frame in a buffer, it asks for the Grey Poupon.

With Fibre Channel, if you tried to push two 8 Gbit links through a single 8 Gbit choke point, no frames would be lost, and each 8 Gbit port would end up throttled back to roughly 4 Gbit through the use of B2B credits.

Why is Fibre Channel so anal retentive? Because SCSI, that’s why. SCSI is the protocol that most enterprise servers use to communicate with storage. (I mean, there’s also SATA, but SCSI makes fun of SATA behind SATA’s back.) Fibre Channel runs the Fibre Channel Protocol, which encapsulates SCSI commands onto Fibre Channel fames (as odd as it sounds, Fibre Channel and Fibre Channel Protocol are two distinct technologies).  Fibre Channel is essentially SCSI over Fibre Channel.

SCSI doesn’t take kindly to dropped commands. It’s a bit of a misconception that SCSI can’t tolerate a lost command. It can, it just takes a long time to recover (relatively speaking). I’ve seen plenty of SCSI errors, and they’ll slow a system down to a crawl. So it’s best not to lose any SCSI commands.

The Converged Clusterfu… Network

We used to have separate storage and networking environments. Now we’re seeing an explosion of convergence: Putting data and storage onto the same (Ethernet) wire.

Ethernet is the obvious choice, because it’s the most popular networking technology. Port per port, Ethernet is the most inexpensive, most flexible, most widely deployed networking technology around. It has slated the FIDDI dragon, the token ring revolution, and now it has its sights on the Fibre Channel Jabberwocky.

The current two competing technologies for this convergence are iSCSI and FCoE. SCSI doesn’t tolerate failure to deliver the SCSI command very well, so both iSCSI and FCoE have ways to guarantee delivery. With iSCSI, delivery is guaranteed because iSCSI runs on TCP, the reliable Layer 4 protocol. If a lower level frame or packet carrying a TCP segment gets lost, no big deal. TCP using sequence numbers, which are like FedEx tracking numbers, and can re-send a lost segment. So go ahead, WRED, do your worst.

FCoE provides losslessness through priority flow control, which is similar to B2B credits in Fibre Channel. Instead of reserving space on the receiving buffer, PFC keeps track of how full a particular buffer is, the one that’s dedicated to FCoE traffic. If that FCoE buffer gets close to full, the receiving Ethernet port sends a PAUSE MAC control frame to the sending port, and the sending port stops. This is done on a port-per-port basis, so end-to-end FCoE traffic is guaranteed to drive without dropping frames. For this to work though, the Ethernet switches need to speak PFC, and that isn’t part of the regular Ethernet standard, and is instead part of the DCB (Data Center Bridging)  set of standards.

Hilarity Ensues

Like the shields of the Enterprise, converged networking is in a state of flux. Network administrators and storage administrators are not very happy with the result. Network administrators don’t want storage traffic (and their silly demands for losslessness) on their data networks. St0rage administrators are appalled by Ethernet and it’s devil-may-care attitude towards frames. They’re also not terribly fond of iSCSI, and only grudgingly accepting of FCoE. But convergence is happening, whether they like it or not.

Personally, I’m not invested in any particular technology. I’m a bit more pro-iSCSI than pro-FCoE, but I’m warming to the later (and certainly curious about it).

But given some dyed-in-the-wool network administrators and server administrators are, the biggest problems in convergence won’t be the technology, but instead will be the Layer 8 issues generated. My take is that it’s time to think like a data center administrator, and not a storage or network administrator. However, that will take time. Until then, hilarity ensues.

Multi-Path Ethernet: The Flying Cars of the Data Center

Update 8/23/11: I’ve added a bit of info on Brocade VCS

In the movie Ghostbusters, Dr Egon Spangler gave a dire warning to the other Ghostbusters: “Don’t cross the streams”.

That’s a bridging loop waiting to happen

In Ethernet, we have a similar warning: “Don’t ever let there be more than one way to get anywhere”. Ethernet is too stupid to handle a condition when a single source MAC address has the ability to get to a destination MAC address by more than one path.

One of the reasons for this is the Ethernet format lacks the Layer 2 version of IP’s TTL. TTLs are decremented at each hop, so if an IP packet does find itself hitting the same place over and over again, eventually the TTL will go to zero and the packet will get dropped. Annoying, but the network isn’t going to be flooded with an ever-increasing barrage of lost packets.With Ethernet and no TTL, a frame could be forwarded in a loop indefinitely. It’s not total proton reversal, but it’s still bad. (Ever cause a bridging loop? I have, it’s a hoot.)

We tend to build redundant networks because it’s a good idea to have, you know, redundancy. And redundancy means multiple paths, which violates Egon’s golden rule. A bit of a conundrum.

Of course, we’ve been doing multiple paths without bridging loops.  The primary solution for the past 21 years has been the spanning-tree protocol. (Fancy that, spanning-tree is legal to drink in the US).

If spanning tree is drinking, spanning-tree will be buying its own drinks, because no one likes spanning tree because of (and not limited to) these annoying attributes:

  • Links are active/standby
  • Topology changes can cause network-wide connectivity outages for 60+ seconds
  • Even rapid spanning-tree causes network outages for several seconds
  • There are several dozen ways to mess up and royally screw you your network (root bridge priority, timer values too low/too high, etc.)

This best expresses our collective feeling for spanning-tree protocol

Some network architectures would avoid STP altogether, by making every pair of switches their own isolated Layer 2 networks, with Layer 3 routing between the pairs. Using a routing protocol like OSPF and fast enough MLS (multi-layer switches), you can build a completely mesh network with plenty of multi-pathing.

Radia Perlman is the mother of STP, although it appears she didn’t quite intend it to end up being used the way it was. She came up with a replacement called TRILL. IEEE brushed her off apparently, so she went to the IETF. The IEEE then said “wait a minute” and came up with their own 802.1aq, (shortest path bridging or SPB). You can take a look at an interesting TRILL/SPB smack down at NANOG here.

IETF/IEEE is the biggest beef since west coast/east coast

Cisco, the largest network vendor, is going the TRILL route, but since TRILL isn’t done yet, they came up with a pre-standard implementation they call Fabric Path. Juniper has come up with their own multi-path technology, not remotely based on a standard, called QFabric.

So why the need for multi-path Ethernet?

For one, STP sucks. A lot. No one likes it. It would sit alone at the lunch table, and not because the other kids are mean, but because STP is a total jerk and we’re stick of its shit. We could go with Layer 3, but that won’t work with virtualization.

On top of that, virtualization has really kicked the need for multi-path up another notch by adding in a lot of east/west traffic that occurs during virtual machine live migrations (vMotion).  If I have to traverse the core every time I do a vmotion from one access switch to another, that’s going to be a very busy core. It’s a choke point, where multi-path lets us build more of a mesh.

Also we’re putting a lot more than data on these networks; we’re adding storage too (iSCSI/FCoE/NFS). So we’re greatly increasing the demand for bandwidth, and it doesn’t make sense to have 10 Gbit links sitting idle.

Another advantage is convergence time. If you go to page 4 of the Fabric Path review at Network World, they found that re-routing of a path failure was 162 milliseconds, a helluva lot quicker than even rapid spanning tree.

So multi-path is great, yada yada. The trick is, you can’t really implement it yet.

It looks like SPB as a protocol is done, but it’s been mostly metro Ethernet vendors that have adopted it (it works with metro and data center Ethernet). TRILL hasn’t been finalized as far as I can tell, although it’s supposed to be real soon now(tm).

Fabric Path from Cisco is shipping, however it only runs right now on the Nexus 7000 series, which hasn’t exactly taken data centers by storm (Cisco’s choice of selling a huge 7010 and an even beggar 7018 is curious). The most popular data center switch is the venerable “old man of the data center”, the Catalyst 6500 and it probably won’t ever do Fabric Path or TRILL.

The 6500 could *potentially* do SPB from what I can tell (doesn’t change the Ethernet format in a way that needs new ASICs), but that would require significant development in IOS, which isn’t exactly easy to develop (one of the reasons why Cisco is moving to NX-OS). The 6500 is also preventing a lot of other data center technologies, like FCoE.

QFabric from Juniper is apparently running in some customer locations, but it’s not released to the general public and isn’t likely to any time soon. QFabric is also proprietary, and not in the “pre-standard” proprietary way that can be changed to interoperate with other vendors at a later date, but the “not a chance in hell will you mesh another vendor’s product” in. As far as I can tell, at least.

Brocade has VCS, which apparently implements a TRILL-like multipath setup, although instead of the IS-IS protocol that TRILL uses, VCS uses FSPF (the routing protocol that Fibre Channel uses). It makes sense that Brocade would use FSPF, since they’re a company with heavy FC chops, and obtained Ethernet chops through acquisition of Foundry. It looks like it only runs on two of their ToR switches, however.

So it looks like we’re stuck with single-path Ethernet networks for the foreseeable future, and using all sorts of tricks/hacks to get around Ethernet’s limitations, like EtherChannel, as well as VSS/vPC and other multi-chassis aggregation. However, those don’t really let us do a “full mesh” network like we’re dreaming of.

So it seems like for now, multi-path Ethernet is like flying cars: We should have had them by now, but we don’t.