The Problem

November 3, 2011 13 Comments

One recurring theme from virtually every one of the Network Field Day 2 vendor presentations last week (as well as the OpenFlow symposium) was affectionately referred to as “The Problem”.

It was a theme because, as vendor after vendor gave a presentation, they essentially said the same thing when describing the problem they were going to solve. For us the delegates/bloggers, it quickly went from the problem to “The Problem”. We’d heard it over and over again so often that during the (5th?) iteration of the same problem we all started laughing like a group of Beavis and Butt-Heads during a vendor’s presentation, and we had to apologize profusely (it wasn’t their fault, after all).

Huh huhuhuhuhuh… he said “scalability issues”

In fact, I created a simple diagram with some crayons brought by another delegate to save everyone some time.

Hello my name is Simon, and I like to do draw-wrings

But with The Problem on repeat it became very clear that the majority of networking companies are all tackling the very same Problem. And imagine the VC funding that’s chasing the solution as well.

So what is “The Problem”? It’s multi-faceted and interrelated set of issues:

Virtualization Has Messed Things Up, Big Time

The biggest problem of them all was caused by the rise of virtualization. Virtualization has disrupted much of the server world, but the impact that it’s had on the network is arguably orders of magnitude greater. Virtualization wants big, flat networks, just when we got to the point where we could route Layer 3 as fast as we could switch Layer 2. We’d just gotten to the point where we could get our networks small.

And it’s not just virtualization in general, much of its impact is the very simple act of vMotion. VMs want to keep their IPs the same when they move, so now we have to bend over backwards to get it done. Add to the the vSwitch sitting inside the hypervisor, and the limited functionality of that switch (and who the hell manages it anyway? Server team? Network team?)

4000 VLANs Ain’t Enough

If you’re a single enterprise running your own network, chances are 4000+ VLANs are sufficient (or perhaps not). In multi-tenant environments with thousands of customers, 4000+ VLANs quickly becomes a problem. There is a need for some type of VLAN multiplier, something like QinQ or VXLAN, which gives us 4096 times 4096 VLANs (16 million or so).

Spanning Tree Sucks

One of my first introductions to networking was accidentally causing a bridging loop on a 10 megabit Ethernet switch (with a 100 Mbit uplink) as a green Solaris admin. I’d accidentally double-connected a hub, and I noticed the utilization LED on the switch went from 0% to 100% when I plugged a certain cable in. I entertained myself with plugging in and unplugging the port to watch the utilization LED flucutate (that is, until the network admin stormed in and asked what the hell was going on with his network).

And thus began my love affair with bridging loops. After the Brocade presentation where we built a TRILL-based Fabric very quickly, with active-active uplinks and nary a port in blocking mode, Ethan Banks became a convert to my anti-spanning tree cause.

OpenFlow offers an even more comprehensive (and potentially more impressive) solution as well. More on that later.

Layer 2 Switching Isn’t Scaling

The current method by which MAC addresses are learned in modern switches causes two problems: Only one viable path can be allowed at a time (only way to prevent loops is to prevent multiple paths by blocking ports), and large Layer 2 networks involve so many MAC addresses that it doesn’t scale.

From QFabric, to TRILL, to OpenFlow (to half a dozen other solutions), Layer 2 transforms into something Layer 3-like. MAC addresses are routed just like IP addresses, and the MAC address becomes just another tuple (another recurring word) for a frame/packet/segment traveling from one end of your datacenter to another. In the simplest solution (probably TRILL?) MAC learning is done at the edge.

There’s A Lot of Shit To Configure

Automation is coming, and in a big way. Whether it’s a centralized controller environment, or magical software powered by unicorn tears, vendors are chomping at the bit to provide some sort of automation for all the shit we need to do in the network and server world. While certainly welcomed, it’s a tough nut to crack (as I’ve mentioned before in Automation Conundrum).

Data center automation is a little bit like the Gom Jabbar. They tried and failed you ask? They tried and died.

“What’s in the box?”

“Pain. And an EULA that you must agree to. Also, man-years of customization. So yeah, pain.”

Ethernet Rules Everything Around Me

It’s quite clear that Ethernet has won the networking wars. Not that this is any news to anyone who’s worked in a data center for the past ten years, but it has struck me that no other technology has been so much as even mentioned as one for the future. Bob Metcalfe had the prophetic quote that Stephen Foskett likes to use: “I don’t know what will come after Ethernet, but it will be called Ethernet.”

But there are limitations (Layer 2 MAC learning, virtualization, VLANs, storage) that need to be addressed for it to become what comes after Ethernet. Fibre Channel is holding ground, but isn’t exactly expanding, and some crazy bastards are trying to merge the two.

Oof. Storage.

Most people agree that storage is going to end up on our network (converged networking), but there are as many opinions on how to achieve this network/storage convergence as there are nerd and pop culture reference in my blog posts. Some companies are pro-iSCSI, others pro FC/NFS, and some like Greg Ferro have the purest of all hate: He hates SCSI.

“Yo iSCSI, I’m really happy for you and imma let you finish, but Fibre Channel is the best storage protocol of all time”

So that’s “The Problem”. And for the most part, the articles on Networking Field Day, and the solutions the vendors propose will be framed around The Problem.

Filed under Ethernet, FCoE, Fibre Channel, NetworkingFieldDay, Storage, Virtualization

13 Responses to The Problem

Pingback: Juniper QFabric, Junosphere, Automation, and More | Router Jockey
Pingback: Internets of Interest:4 Nov 2011 — My Etherealmind
Ethan Banks says:

November 4, 2011 at 5:24 pm

You busted out a Dune reference. #winning

Reply
Ben Johnson says:

November 4, 2011 at 6:32 pm

Interesting post; entertainingly written as usual 🙂 So, at least for now, we have to hold cap in hand and admit it IS the network? It feels like heresy but, if virtualization=good (broadly speaking) and storage/networking convergence is the way of the [near] future, then is it the network that’s digging in its heels a bit in terms of progress?

Reply
- tonybourke says:
  
  November 4, 2011 at 10:23 pm
  
  A very interesting thought (and possibly another blog post entirely). My initial thoughts are yes, some of it is in fact the network. The network is at a bit of a disadvantage because certain new technologies require new silicon in order to scale. With servers, it’s usually new code on an OS. With switches and routers, software doesn’t cut it for the most part. If a new type of Ethernet frame is required (as is it is with TRILL) then new silicon needs to be spun.
  
  But very interesting thought.
  
  Reply
Brook Reams (@BRCDbreams) says:

November 4, 2011 at 7:17 pm

Tony,

If Greg “hates on SCSI”, that’s like hating on Ethernet. Both are survivors in their ecological niche for similar reasons. Learn to embrace them I say. 😉

“Whatever replaces SCSI will be called SCSI.” – The Tao of Storage

Reply
Anshuman Jain says:

November 5, 2011 at 2:40 am

VXLAN solves the purpose of scaling VLAN’s. It also solves the issue of isolation in a multi-tenant environment.

Why is then VXLAN not adopted ?
What are the issues in VXLAN that is making companies so deperate to look for alternate solutions like NVGRE and LISP ?

Reply
tonybourke says:

November 5, 2011 at 9:09 am

VXLAN is an interesting idea, but it doesn’t solve anything right now, as it’s a brand-new standard, and only one product (OpenSwitch?) even supports it. VMware’s own vSwitch doesn’t, and neither does Cisco’s Nexus 1000V.

Also, it’s purely a software solution right now. Physical devices can’t terminate VXLAN tunnels, and won’t for a while. The silicon needs to be spun to do that, and that doesn’t happen overnight.

So could it be “The Solution” to “The Problem”? Perhaps. Time will tell, but it’s not solving The Problem right now.

LISP solves a different “Problem”, in that an IP address is both an identifier and a location. We can’t have the same IP address in two places, and an IP address is typically stuck to a geographical area. BGP don’t allow us to do host routes (/32) because the BGP tables would grow far too large.

NVGRE? The benefit there is switch hardware should be able to handle it in silicon since it’s GRE-based. The drawback to VXLAN is that since we’re encapsulating in IP instead of UDP as VXLAN does it, ECMP won’t work as well because we can’t load balance flows based on UDP port.

Reply
Ben O'Rourke says:

November 6, 2011 at 3:56 pm

Talking about QinQ and not using an Exibit meme?

http://memegenerator.net/instance/11222390

Reply
- tonybourke says:
  
  November 6, 2011 at 4:16 pm
  
  I also could have gone with “double 802.1Q, what does it mean?”
  
  Reply
Pingback: Networking Field Day 2: The Links
Anshuman Jain says:

November 12, 2011 at 12:11 am

The implementation of VXLAN is lacking right now, but does the concept in itself has a short-coming or does it poses and gives birth to a new issue ?

Reply
Pingback: Scaling Virtual Appliances with Embrane — My Etherealmind

The Data Center Overlords

The Problem