OpenFlow/SDN Won’t Scale?

I got in a conversation today on Twitter, talking about SDN/SDF (software defined forwarding), which is a new term I totally made up which I use to describe the programmatic and centralized control of forwarding tables on switches and multi-layer switches. The comment was made that OpenFlow in particular won’t scale, which reminded me of an article by Doug Gourlay of Arista talking about scalability issues with OpenFlow.

The argument that Doug Gourlay of Arista had is essentially that OpenFlow can’t keep up with the number of new flows in a network (check out points 2 and 3). In a given data center, there would be tens of thousands (or millions or tens of millions) of individual flows running through a network at any given moment. And by flows, I mean keeping track of stateful TCP connection or UDP pseudo-flows. The connection rate would also be pretty high if you’re talking dozens or hundreds of VMs, all taking in new connections. 

My answer is that yeah, if you’re going to try to put the state of every TCP connection and UDP flow into the network operating system and into the forwarding tables of the devices, that’s not going to scale. I totally agree.

But why would you do that?

Why wouldn’t you, instead of keeping track of every flow, do destination-based forwarding table entries, which is how forwarding tables on switches are currently programmed? The network operating system/controller would learn (or be told about) the specific VMs and other devices within the data center. It could learn this through conversational learning, flooding, static entries (configured by hand), automated static entries (where an API programs it in such as connectivity through vCenter), or externally through traditional MAC flooding and routing protocols.

In that case, the rate of change in the forwarding tables would be relatively low, and not much different then how switches are currently programmed with Layer 3 routes and Layer 2 adjacencies with traditional methods. This would actually be more likely shrink the size of the forwarding tables when compared with traditional Ethernet/IP forwarding, as the controller could intelligently prune the forwarding tables of the switches rather than flood and learn every MAC address on every switch that has a given VLAN configured (similar to what TRILL/FabricPath/VCS does). 

We don’t track every TCP/UDP flow in a data center with our traditional networking, and I can’t think of any value-add to keeping track of every flow in a given network, even if you could. So why would OpenFlow or any other type of SDF do any different? We have roughly the same size tables, and we have the added benefit of including Layer 4, VXLAN, NVGRE, or even VLANs in the forwarding decisions.

I honestly don’t know if keeping track of every flow was the original concept with OpenFlow (I can’t imagine it would be, but there’s a lot of gaps in my OpenFlow knowledge), but it seems an API that programs a forwarding table could be made to do so without keeping traffic of every gosh darn flow.

Software Defined Fowarding: What’s In A Name?

I’ve got about four or five articles on SDN/ACI and networking in my drafts folder, and there’s something that’s been bothering me. There’s a concept, and it doesn’t have a name (at least one that I’m aware of, though it’s possible there is). In networking, we’re constantly flooded with a barrage of new names. Sometimes we even give things a name that already have a name. But naming things is important. It’s like a programming pointer, a name of something will be a pointer to a specific part of our brain that contains the understanding a of a given concept.

Software Defined Forwarding

I decided SDF needed a name because I was having trouble describing a concept, a very important one, in reference to SDN. And here’s the concept, with three key aspects:

  1. Forwarding traffic in a way that is different than traditional MAC learning/IP forwarding.
  2. Forwarding traffic based on more than the usual Layer 2 and Layer 3 headers (such as VXLAN headers, TCP and/or UDP headers).
  3. Programming these forwarding rules via a centralized controller (which would be necessary if you’re going to throw out all the traditional forwarding rules and configure this in any reasonable amount of time).
  4. In the cases of Layer 2 adjancencies, multipathing is inherent (either through overlays or something like TRILL).

Throwing out traditional rules of forwarding is what OpenFlow does, and OpenFlow got a lot of the SDN ball rolling. However, SDN has gone far beyond just OpenFlow, so it’s not fair to call it just OpenFlow anymore. And SDN is now far too broad of a term to be specific to fowarding rules, as SDN encompasses a wide range of topics, from forwarding to automation/orchestration, APIs, network function virtualization (NFV), stack integration (compute, storage), etc. If you take a look at the shipping SDN product listing, many of the products don’t have anything to do with actually forwarding traffic. Also something like FabricPath, TRILL, VCS, and SPB all do Ethernet forwarding in a non-standard way, but this is different. So herein lies the need I think to define a new term.

And make no mistake, SDF (or whatever we end up calling it) is going to be an important factor in what pushes SDN from hype to implementation.

Forwarding is simply how we get a packet from here to there. In traditional networking, forwarding rules are separate by layer. Layer 2 Ethernet forwarding works in one particular way (and not a particularly intelligent way), IP routing works in another way.  Layer 4 (TCP/UDP) gets a little tricky, as switches and routers typically don’t forward traffic based on Layer 4 headers. You can do something like policy based routing, but it’s a bit combersome to setup. You also have load balancers handling some Layer 4-7 features, but that’s handled in a separate manner.

So traditional forwarding for Layer 2 and Layer 3 hasn’t changed much. For instance, take the example of a server plugging into a switch and powering up. Its IP address and MAC address haven’t been seen on the network, so the network is unaware of both the Layer 2 and Layer 3 addresses. The server is plugged it into a switch with a port set for the right VLAN. The server wants to send a packet out to the internet, so it ARPs (WHO HAS, for example). The router (or more likely, SVI) responds with its MAC address, or the floating MAC (VMAC) of the first-hop redundancy protocol, such as VRRP or HSRP. Every switch attached to that VLAN sees this ARP and ARP response, and every switches Layer 2 forwarding table learned which port to find that particular MAC address on.


In several of the SDF implementations that I’m familiar with, such as Cisco’s ACI, NEC’s OpenFlow controller, and to a certain degree Juniper’s QFabric, things get a little weird.


Forwarding in SDF/SDN, Gets A Little Weird

In the SDF (or many implementations of SDF, they differ in how they accomplish this) the local TOR/leaf switch is what answers the server’s ARP. Another server, on the same subnet and L2 segment (or increasingly a tenant), ARPs on a different leaf switch.


In the diagram above, note both servers have a default gateway of The two servers are Layer 2 adjacent, existing on the same network segment (same VLAN or VXLAN). Both ARP for their default gateways, and both receive a response from their local leaf switch with an identical MAC address. Neither server would actually see the other’s ARP, because there’s no need to flood that ARP traffic (or most other BUM traffic) beyond the port that the server is connected to. BUM traffic goes to the local leaf, and the leaf can learn and forward in a more intelligent manner. The other leaf nodes don’t need to see the flooded traffic, and a centralized controller can program the forwarding tables of the leafs and spines accordingly.

In most cases, the packet’s forwarding decision, based on a combination of Layer 2, 3, 4 and possibly more (such as VXLAN) is made at the leaf. If it’s moving past a Layer 3 segment, the TTL gets decremented. The forwarding is typically determined at the leaf, and some sort of label or header applies that contains its destination port.

You have this somewhat with a full Layer 3 leaf/spine mesh, in that the local leaf is who answers the ARP. However, in a Layer 3 mesh hosts connected to different leaf switches are on different network segments, and the leaf is the default gateway (without getting werid).  In some applications, such as Hadoop, that’s fine. But for virtualization (unfortunately) there’s a huge need for Layer 2 adjacencies. For now, at least

Another benefit of SDF is the ability to intelligently steer traffic through various network devices, known as service chaining. This is done without changing default gateways of the servers, bridging VLANs and proxy arp, or other current methodologies. Since SDF throws out the rulebook in terms of forwarding, it becomes a much simpler matter to perform traffic steering. Cisco’s ACI does this, as does Cisco vPath and VMware’s NSX.


A policy, programmed on a central controller, can be put in place to ensure that traffic forwards through a load balancer and firewall. This also has a lot of potential in the realm of multi-tenancy and network function virtualization. In short, combined with other aspects of SDN, it can change the game in terms of how network are architected in the near future.

SDF is only a part of SDN. By itself, it’s compelling, but as there have been some solutions on the market for a little while, it doesn’t seem to be “must have” to the point where customers are upending their data center infrastructures to replace everything. I think for it to be a must have, it needs to be a part

The Twilight of the Age of Conf T

That sums up the networking world as it exists today. Conf T.

On Cisco gear, that’s the command you type to go into configuration mode, and also a lot of gear that isn’t Cisco. It’s so ingrained in our muscle memory it’s probably the quickest thing any network engineer can type.

On Nexus gear, which runs NX-OS, you don’t need to type the “t” in “conf t”. Typing “conf” will get you into configuration mode, no “T” is required. Same for most other CLIs that employ the “industry standard” CLI that everyone (including Cisco) appropriated.

Yet most of us have it so ingrained in our muscle memory we can’t type “conf” without throwing the “t” at the end. (I had to edit that sentence just to get the “t” out of the first “conf”…. dammit!)

But is that age ending?

VMware released NSX, other companies are releasing their versions of SDN, SDDC, and.. whatever. These are networks that more and more will be controlled not by the manually punching out a CLI, but rather GUI and/or APIs.

We’ve configured networks for the past — what, 20 years — by starting out with “conf t”. And we’ve certainly heard more than one prediction of its demise that turned out to be a flash in the pan. However…

This certainly feels like it could be the beginning of the end of the conf t age. How does something this ubiquitous end?

Gradually, then all of a sudden.

VXLAN: Millions or Billions?

I was putting slides together for my upcoming talk and there is some confusion about VXLAN in particular, how many VLANs it provides.

The VXLAN header provides a 24-bit address space called the VNI (VXLAN Network Identifier) to separate out tenant segments, which is 16 million. And that’s the number I see quoted with regards to VXLAN (and NVGRE, which also has a 24-bit identifier). However, take a look at the entire VXLAN packet (from the VXLAN standard… whoa, how did I get so sleepy?):


Tony’s amazing Technicolor Packet Nightmare

The full 802.1Q Ethernet frame can be encapsulated, providing the full 12-bit 4096 VLANs per VXLAN. 16 million multiplied by 4096 is about 68 billion (with a “B”). However, most material discussing VXLAN refers to 16 million.

So is it 16 million or 68 billion?


The answer is: Yes?

So according to the standard, each VXLAN segment is capable of carrying an 802.1Q encoded Ethernet frame, so each VXLAN can have a total of 4096(ish) VLANs. The question is whether or not this is actually feasible. Can we run multiple VLANs over VXLAN? Or is each VXLAN only going to realistically carry a single (presumably non-tagged) VLAN.

I think much of this depends on how smart the VTEP is. The VTEP is the termination point, the encap/decap point for the VXLANs. Regular frames enter a VTEP and get encapsulated and sent over the VXLAN overlays (regular Layer 3 fabric) to another VTEPthe terminating endpoint and decap’d.

The trick is the MAC learning process of the VTEPs. Each VTEP is responsible for learning the local MAC addresses as well as the destination MAC addresses, just like a traditional switch’s CAM table. Otherwise, each VTEP would act kind of like a hub, and send every single unicast frame to every other VTEP associated with that VXLAN.

What I’m wondering is, do VTEPs keep separate MAC tables per VLAN?

I’m thinking it must create a multi-VLAN table, because what happens if we have the same MAC address in two different VLANs? A rare occurrence  to be sure, but I don’t think it violates any standards (could be wrong on that). If it only keeps a single MAC table for all VLANs, then we really can’t run multiple VLANs per VXLAN. But I imagine it has to keep multiple tables per VLAN. Or at least, it should.

I can’t imagine there would be any situation where different tenants would get VLANs in the same VXLAN/VNI, so there are still 16 million multi-tenant segments, so it’s not exactly 68 billion VLANs.  But each tenant might be able to have multiple VLANs.

Having tenants capable of having multiple VLAN segments may prove to be useful, though I doubt any tenant would have more than a handful of VLANs (perhaps DMZ, internal, etc.). I haven’t played enough with VXLAN software yet to figure this one out, and discussions on Twitter (many thanks to @dkalintsev for great discussions) while educational haven’t seemed to solidify the answer.

Software is the Hard Part

Boy, that escalated quickly

Well, that was unexpected. The Super-Duper big news Monday (July 23rd) is that VMware has purchased Nicira, and it seems (at least right now in the heat of the moment) that this is a game changer for the industry. I’m writing down my thoughts on this subject here, which may or may not turn out to be insightful. I’m just spitballing here. For some great perspectives check out Brad Casemore’s post and Greg Ferro’s post on Network Computing.

First, this could mark a turning point when networking moves from mostly a hardware game to mostly a software game. With Nicira, VMware has purchased arguably the most advanced control plane out there by a long shot. And unlike Cisco’s Insieme spin-in and Cisco OnePK, Nicira is a shipping, mature(-ish?) product with big-name customers (though how widely deployed it is unknown).

Most are in agreement that this change from hardware to software was coming, but I think that most (including myself) figured this change would take a few more years. Nicira had been an interesting curiosity until now; a few big name customers sure, but not really market penetration. With VMware having an presence in just about every data center in the world, Nicira can be pitched/adapted to a much, much wider audience.

Why the change from hardware to software? Mostly the commoditization of network hardware. Vendors like Broadcom and Intel (through the Fulcrum purchase) offer SoC (Switch on Chips) and other Ethernet silicon that can be (relatively) easily engineered into a switch with much less R&D than was previously possible. As has been mentioned, most of the network vendors use these now to build their switches. Cisco has been pretty much the lone hold out in this trend, continuing to invest in their own R&D, chipsets, and hardware. Even Juniper’s ambitious QFabric is believed to run on top of Broadcom Trident+ chips.

This will be a challenge not just for Cisco, but Juniper, Arista, etc., as the differentiation in capabilities and performance between their silicon versus commodity silicon is declining. If you can’t differentiate in hardware, they’ll have to create new differentiations.

Nicira could potentially take away one of those differentiations: Software. Cisco, Juniper, Brocade, and others have been working on software differentiations. Cisco has several technologies, including FabricPath (very cool, but licensed badly), Juniper has QFabric, Brocade with VCS Fabric, etc. Cisco also has the Nexus 1000v, and while Nicira is not an immediate threat, it could potentially put a wrench in those gears.

Nicira is a very advanced control plane, in many ways very different than anything currently out there. Most people run the usual suspects for their control planes: Spanning-tree, OSPF, BGP, etc. Slightly more modern control planes are TRILL, SPB, VXLAN and NVGRE, OpenFlow, and Juniper’s QFabric. But none of them tie it all together end-to-end quite like Nicria does. And now it belongs to VMware. And because it belongs to VMware, Nicira now has exposure into almost every data center on the planet.

In a Nicira SDN world, the only thing you need from a network vendor in a data center is a network built with OSPF and inexpensive Broadcom/Fulcrum-based switches. You wouldn’t need TRILL/SPB, FabricPath, QFabric, VDS, or even spanning-tree (since every pair of switches would be its own Layer 3 domain). The Nicira/SDN controller would create the overlay based on whatever overlay network technologies (VXLAN/NVGRE/STT) between the virtual switches located in the hypervisor.

Espcially with an SDN-dominated world, there’s not much to differentiate on hardware anymore, and in an SDN world software won’t be much of a differentiation either, since one vendor’s OSPF isn’t going to be different than another’s.

In terms of SDN, Cisco, Juniper, et al are behind. For starters, neither Cisco nor Juniper have really lead the SDN charge. They’ve both opened up or announced that they will shortly open up their switches and routers with APIs that will allow SDN controllers to control them, but they both lack a controller of their own and have seemed for the most part to have more of a defensive strategy against SDN (since it potentially distrupts them).

This could be tough for Juniper, Cisco, Arista, etc. to overcome. They’re mostly geared as hardware companies. Turning them into full-fledged software companies will be a challenge. Insime is supposed to be SDN related, but who knows if it’ll be an answer to Nicira, or something more anemic.

As with all of this, time will tell. Things are changing so fast, it’s impossible to predict the future. But one thing I am fairly certain of is that software, as Nokia and RIM have figured out, software is the hard part.