OpenFlow/SDN Won’t Scale?

I got in a conversation today on Twitter, talking about SDN/SDF (software defined forwarding), which is a new term I totally made up which I use to describe the programmatic and centralized control of forwarding tables on switches and multi-layer switches. The comment was made that OpenFlow in particular won’t scale, which reminded me of an article by Doug Gourlay of Arista talking about scalability issues with OpenFlow.

The argument that Doug Gourlay of Arista had is essentially that OpenFlow can’t keep up with the number of new flows in a network (check out points 2 and 3). In a given data center, there would be tens of thousands (or millions or tens of millions) of individual flows running through a network at any given moment. And by flows, I mean keeping track of stateful TCP connection or UDP pseudo-flows. The connection rate would also be pretty high if you’re talking dozens or hundreds of VMs, all taking in new connections. 

My answer is that yeah, if you’re going to try to put the state of every TCP connection and UDP flow into the network operating system and into the forwarding tables of the devices, that’s not going to scale. I totally agree.

But why would you do that?

Why wouldn’t you, instead of keeping track of every flow, do destination-based forwarding table entries, which is how forwarding tables on switches are currently programmed? The network operating system/controller would learn (or be told about) the specific VMs and other devices within the data center. It could learn this through conversational learning, flooding, static entries (configured by hand), automated static entries (where an API programs it in such as connectivity through vCenter), or externally through traditional MAC flooding and routing protocols.

In that case, the rate of change in the forwarding tables would be relatively low, and not much different then how switches are currently programmed with Layer 3 routes and Layer 2 adjacencies with traditional methods. This would actually be more likely shrink the size of the forwarding tables when compared with traditional Ethernet/IP forwarding, as the controller could intelligently prune the forwarding tables of the switches rather than flood and learn every MAC address on every switch that has a given VLAN configured (similar to what TRILL/FabricPath/VCS does). 

We don’t track every TCP/UDP flow in a data center with our traditional networking, and I can’t think of any value-add to keeping track of every flow in a given network, even if you could. So why would OpenFlow or any other type of SDF do any different? We have roughly the same size tables, and we have the added benefit of including Layer 4, VXLAN, NVGRE, or even VLANs in the forwarding decisions.

I honestly don’t know if keeping track of every flow was the original concept with OpenFlow (I can’t imagine it would be, but there’s a lot of gaps in my OpenFlow knowledge), but it seems an API that programs a forwarding table could be made to do so without keeping traffic of every gosh darn flow.

Software Defined Fowarding: What’s In A Name?

I’ve got about four or five articles on SDN/ACI and networking in my drafts folder, and there’s something that’s been bothering me. There’s a concept, and it doesn’t have a name (at least one that I’m aware of, though it’s possible there is). In networking, we’re constantly flooded with a barrage of new names. Sometimes we even give things a name that already have a name. But naming things is important. It’s like a programming pointer, a name of something will be a pointer to a specific part of our brain that contains the understanding a of a given concept.

Software Defined Forwarding

I decided SDF needed a name because I was having trouble describing a concept, a very important one, in reference to SDN. And here’s the concept, with three key aspects:

  1. Forwarding traffic in a way that is different than traditional MAC learning/IP forwarding.
  2. Forwarding traffic based on more than the usual Layer 2 and Layer 3 headers (such as VXLAN headers, TCP and/or UDP headers).
  3. Programming these forwarding rules via a centralized controller (which would be necessary if you’re going to throw out all the traditional forwarding rules and configure this in any reasonable amount of time).
  4. In the cases of Layer 2 adjancencies, multipathing is inherent (either through overlays or something like TRILL).

Throwing out traditional rules of forwarding is what OpenFlow does, and OpenFlow got a lot of the SDN ball rolling. However, SDN has gone far beyond just OpenFlow, so it’s not fair to call it just OpenFlow anymore. And SDN is now far too broad of a term to be specific to fowarding rules, as SDN encompasses a wide range of topics, from forwarding to automation/orchestration, APIs, network function virtualization (NFV), stack integration (compute, storage), etc. If you take a look at the shipping SDN product listing, many of the products don’t have anything to do with actually forwarding traffic. Also something like FabricPath, TRILL, VCS, and SPB all do Ethernet forwarding in a non-standard way, but this is different. So herein lies the need I think to define a new term.

And make no mistake, SDF (or whatever we end up calling it) is going to be an important factor in what pushes SDN from hype to implementation.

Forwarding is simply how we get a packet from here to there. In traditional networking, forwarding rules are separate by layer. Layer 2 Ethernet forwarding works in one particular way (and not a particularly intelligent way), IP routing works in another way.  Layer 4 (TCP/UDP) gets a little tricky, as switches and routers typically don’t forward traffic based on Layer 4 headers. You can do something like policy based routing, but it’s a bit combersome to setup. You also have load balancers handling some Layer 4-7 features, but that’s handled in a separate manner.

So traditional forwarding for Layer 2 and Layer 3 hasn’t changed much. For instance, take the example of a server plugging into a switch and powering up. Its IP address and MAC address haven’t been seen on the network, so the network is unaware of both the Layer 2 and Layer 3 addresses. The server is plugged it into a switch with a port set for the right VLAN. The server wants to send a packet out to the internet, so it ARPs (WHO HAS 1.1.1.1, for example). The router (or more likely, SVI) responds with its MAC address, or the floating MAC (VMAC) of the first-hop redundancy protocol, such as VRRP or HSRP. Every switch attached to that VLAN sees this ARP and ARP response, and every switches Layer 2 forwarding table learned which port to find that particular MAC address on.

tradcoreaccess

In several of the SDF implementations that I’m familiar with, such as Cisco’s ACI, NEC’s OpenFlow controller, and to a certain degree Juniper’s QFabric, things get a little weird.

dogsandcats

Forwarding in SDF/SDN, Gets A Little Weird

In the SDF (or many implementations of SDF, they differ in how they accomplish this) the local TOR/leaf switch is what answers the server’s ARP. Another server, on the same subnet and L2 segment (or increasingly a tenant), ARPs on a different leaf switch.

leafspinecray

In the diagram above, note both servers have a default gateway of 1.1.1.1. The two servers are Layer 2 adjacent, existing on the same network segment (same VLAN or VXLAN). Both ARP for their default gateways, and both receive a response from their local leaf switch with an identical MAC address. Neither server would actually see the other’s ARP, because there’s no need to flood that ARP traffic (or most other BUM traffic) beyond the port that the server is connected to. BUM traffic goes to the local leaf, and the leaf can learn and forward in a more intelligent manner. The other leaf nodes don’t need to see the flooded traffic, and a centralized controller can program the forwarding tables of the leafs and spines accordingly.

In most cases, the packet’s forwarding decision, based on a combination of Layer 2, 3, 4 and possibly more (such as VXLAN) is made at the leaf. If it’s moving past a Layer 3 segment, the TTL gets decremented. The forwarding is typically determined at the leaf, and some sort of label or header applies that contains its destination port.

You have this somewhat with a full Layer 3 leaf/spine mesh, in that the local leaf is who answers the ARP. However, in a Layer 3 mesh hosts connected to different leaf switches are on different network segments, and the leaf is the default gateway (without getting werid).  In some applications, such as Hadoop, that’s fine. But for virtualization (unfortunately) there’s a huge need for Layer 2 adjacencies. For now, at least

Another benefit of SDF is the ability to intelligently steer traffic through various network devices, known as service chaining. This is done without changing default gateways of the servers, bridging VLANs and proxy arp, or other current methodologies. Since SDF throws out the rulebook in terms of forwarding, it becomes a much simpler matter to perform traffic steering. Cisco’s ACI does this, as does Cisco vPath and VMware’s NSX.

servicechaining

A policy, programmed on a central controller, can be put in place to ensure that traffic forwards through a load balancer and firewall. This also has a lot of potential in the realm of multi-tenancy and network function virtualization. In short, combined with other aspects of SDN, it can change the game in terms of how network are architected in the near future.

SDF is only a part of SDN. By itself, it’s compelling, but as there have been some solutions on the market for a little while, it doesn’t seem to be “must have” to the point where customers are upending their data center infrastructures to replace everything. I think for it to be a must have, it needs to be a part

Death To vMotion

There are very few technologies in that data center that have had as significant of an impact of VMware’s vMotion. It allowed us to de-couple operating system and server operations. We could maintain, update, and upgrade the underlying compute layer without disturbing the VMs they ran on. We can write web applications in the same model that we’re used to, when we wrote them to specific physical servers. From an application developer perspective, nothing needed to be changed. From a system administrator perspective, it helped make (virtual) server administration easier and more flexible. vMotion helped us move almost seamlessly from the physical world to the virtualization world with nary a hiccup. Combined with HA and DRS, it’s made VMware billions of dollars.

And it’s time for it to go.

From a networking perspective, vMotion has reeked havoc on our data center designs. Starting in the mid 2000s, we all of a sudden needed to build huge Layer 2 networking domains, instead of beautiful and simple Layer 3 fabrics. East-West traffic went insane. With multi-layer switches (Ethernet switches that could route as fast as they could switch), we had just gotten to the point where we could build really fast Layer 3 fabrics, and get rid of spanning-tree. vMotion required us to undo all that, and go back to Layer 2 everywhere.

But that’s not why it needs to go.

Redundant data centers and/or geographic diversification is another area that vMotion is being applied to. Having the ability to shift a workload from one data center to another is one of the holy grails of data centers but to accomplish this we need Layer 2 data center interconnects (DCI), with technologies like OTV, VPLS, EoVPLS, and others. There’s also a distance limitation, as the latency between two datacenters needs to be 10 milliseconds or less. And since light can only travel so far in 10 ms, there is a fairly limited distance that you can effectively vMotion (200 kilometers, or a bit over 120 miles). That is, unless you have a Stargate.

DCI

You do have a Stargate in your data center, right?

And that’s just getting a VM from one data center to another, which someone described to me once as a parlor trick. By itself, it serves no purpose to move a VM from one data center to another. You have to get its storage over as well (VMDK files if your lucky, raw LUNs if you’re not) and deal with the traffic tromboning problem from one data center to another.

The IP address is still coupled to the server (identity and location are coupled in normal operations, something LISP is meant to address), so traffic still comes to the server via the original data center, traverses the DCI, then the server responds through its default gateway, which is still likely the original data center. All that work to get a VM to a different data center, wasted.

trombone2

All for one very simple reasons: A VM needs to keep its IP address when it moves. It’s IP statefullness, and there are various solutions that attempt to address the limitations of IP statefullness. Some DCI technologies like OTV will help keep default gateways to the local data center, so when a server responds it at least doesn’t trombone back through the original data center. LISP is (another) overlay protocol meant to decouple the location from the identity of a VM, helping with mobility. But as you add all these stopgap solutions on top of each other, it becomes more and more cumbersome (and expensive) to manage.

All of this because a VM doesn’t want to give its IP address up.

But that isn’t the reason why we need to let go of vMotion.

The real reason why it needs to go is that it’s holding us back.

Do you want to really scale your application? Do you want to have fail-over from one data center to another, especially over distances greater than 200 kilometers? Do you want to be be able to “follow the Sun” in terms of moving your workload? You can’t rely on vMotion. It’s not going to do it, even with all the band-aids meant to help it.

The sites that are doing this type of scaling are not relying on vMotion, they’re decoupling the application from the VM. It’s the metaphor of pets versus cattle (or as I like to refer to it, bridge crew versus redshirts). Pets is the old way, the traditional virtualization model. We care deeply what happens to a VM, so we put in all sorts of safety nets to keep that VM safe. vMotion, HA, DRS, even Fault Tolerance. With cattle (or redshirts), we don’t really care what happens to the VMs. The application is decoupled from the VM, and state is not solely stored on a single VM. The “shopping cart” problem, familiar to those who work with load balancers, isn’t an issue. So a simple load balancer is all that’s required, and can send traffic to another server without disrupting the user experience. Any VM can go away at any level (database, application, presentation/web layer) and the user experience will be undisturbed. We don’t shed a tear when a redshirt bites it, thus vMotion/HA/DRS are not needed.

If you write your applications and build your application stack as if vMotion didn’t exit, scaling and redundancy are geographic diversification get a lot easier. If your platform requires Layer 2 adjacency, you’re doing it wrong (and you’ll be severely limited in how you can scale).

And don’t take my word for it. Take a look at any of the huge web sites, Netflix, Twitter, Facebook: They all shard their workloads across the globe and across their infrastructure (or Amazons). Most of them don’t even use virtualization. Traditional servers sitting behind a load balancer with a active/standby pair of databases on the back-end isn’t going to cut it.

When you talk about sharding, make sure people know it’s spelled with a “D”. 

If you write an application on Amazon’s AWS, you’re probably already doing this since there’s no vMotion in AWS. If an Amazon data center has a problem, as long as the application is architected correctly (again, done on the application itself), then I can still watch my episodes of Star Trek: Deep Space 9. It takes more work to do it this way, but it’s a far more effective way to scale/diversify your geography.

It’s much easier (and quicker) to write a web application for the traditional model of virtualization. And most sites first outing will probably be done in this way. But if you want to scale, it will be way easier (and more effective) to build and scale an application.

VMware’s vMotion (and Live Migration, and other similar technologies by other vendors) had their place, and they helped us move from the physical to the virtual. But now it’s holding us back, and it’s time for it to go.

The Twilight of the Age of Conf T

That sums up the networking world as it exists today. Conf T.

On Cisco gear, that’s the command you type to go into configuration mode, and also a lot of gear that isn’t Cisco. It’s so ingrained in our muscle memory it’s probably the quickest thing any network engineer can type.

On Nexus gear, which runs NX-OS, you don’t need to type the “t” in “conf t”. Typing “conf” will get you into configuration mode, no “T” is required. Same for most other CLIs that employ the “industry standard” CLI that everyone (including Cisco) appropriated.

Yet most of us have it so ingrained in our muscle memory we can’t type “conf” without throwing the “t” at the end. (I had to edit that sentence just to get the “t” out of the first “conf”…. dammit!)

But is that age ending?

VMware released NSX, other companies are releasing their versions of SDN, SDDC, and.. whatever. These are networks that more and more will be controlled not by the manually punching out a CLI, but rather GUI and/or APIs.

We’ve configured networks for the past — what, 20 years — by starting out with “conf t”. And we’ve certainly heard more than one prediction of its demise that turned out to be a flash in the pan. However…

This certainly feels like it could be the beginning of the end of the conf t age. How does something this ubiquitous end?

Gradually, then all of a sudden.

Link Aggregation Confusion

In a previous article, I discussed the somewhat pedantic question: “What’s the difference between EtherChannel and port channel?” The answer, as it turns out, is none. EtherChannel is mostly an IOS term, and port channel is mostly an NXOS term. But either is correct.

But I did get one thing wrong. I was using the term LAG incorrectly. I had assumed it was short for Link Aggregation (the umbrella term of most of this). But in fact, LAG is short for Link Aggregation Group, which is a particular instance of link aggregation, not the umbrella term. So wait, what do we call the technology that links links together?

saymyname

LAG? Link Aggregation? No wait, LACP. It’s gotta be LACP.

In case you haven’t noticed, the terminology for one of the most critical technologies in networking (especially the data center) is still quite murky.

Before you answer that, let’s throw in some more terms, like LACP, MLAG, MC-LAG, VLAG, 802.3ad, 802.1AX, link bonding, and more.

The term “link aggregation” can mean a number of things. Certainly EtherChannel and port channels are are form of link aggregation. 802.3ad and 802.1AX count as well. Wait, what’s 802.1AX?

802.3ad versus 802.1AX

What is 802.3ad? It’s the old IEEE working group for what is now known as 802.1AX. The standard that we often refer to colloquially as port channel, EtherChannels, and link aggregation was moved from the 802.3 working group to the 802.1 working group sometime in 2008. However, it is sometimes still referred to as 802.3ad. Or LAG. Or link aggregation. Or link group things. Whatever.

spaceghost

What about LACP? LACP is part of the 802.1AX standard, but it is neither the entirety of the 802.1AX standard, nor is it required in order to stand up a LAG.  LACP is also not link aggregation. It is a protocol to build LAGs automatically, versus static. You can usually build an 802.1AX LAG without using LACP. Many devices support static and dynamic LAGs. VMware ESXi 5.0 only supported static LAGs, while ESXi 5.1 introduced LACP as a method as well.

Some devices only support dynamic LAGs, while some only support static. For example, Cisco UCS fabric interconnects require LACP in order to setup a LAG (the alternative is to use pinning, which is another type of link aggregation, but not 802.1AX). The discontinued Cisco ACE 4710 doesn’t support LACP at all, instead only static LAGs are supported.

One way to think of LACP is that it is a control-plane protocol, while 802.1AX is a data-plane standard. 

Is Cisco’s EtherChannel/port channel proprietary?

As far as I can tell, no, they’re not. There’s no (functional at least) difference between 802.3ad/802.1ax and what Cisco calls EtherChannel/port channel, and you can set up LAGs between Cisco and non-Cisco without any issue.  PAgP (Port Aggregation Protocol), the precursor to LACP, was proprietary, but Cisco has mostly moved to LACP for its devices. Cisco Nexus kit won’t even support PAgP.

Even in LACP, there’s no method for negotiating the load distribution method. Each side picks which method it wants to do. In fact, you don’t have to have the same load distribution method configured on both ends of a LAG (though it’s usually a good idea).

There is are also types of link aggregation that aren’t part of the 802.1AX or any other standard. I group these types of link aggregation into two types: Pinning, and fake link aggregation. Or FLAG (Fake Link Aggregation).

First, lets talk about pinning. In Ethernet, we have the rule that there can’t be more than one way to get anywhere. Ethernet can’t handle multi-pathing, which is why we have spanning-tree and other tricks to prevent there from being more than one logical way for an Ethernet frame to get from one source MAC to a given destination MAC. Pinning is a clever way to get around this.

The most common place we tend to see pinning is in VMware. Most ESXi hosts have multiple connections to a switch. But it doesn’t have to be the same switch. And look at that, we can have multiple paths. And no spanning-tree protocol. So how do we not melt down the network?

The answer is pinning. VMware refers to this as load balancing by virtual port ID. Each VM’s vNIC has a virtual port ID, and that ID is pinning to one and only one of the external physical NICs (pNICs). To utilize all your links, you need at least as many virtual ports as you do physical ports. And load distributation can be an issue. But generally, this pinning works great. Cisco UCS also uses pinning for both Ethernet and Fibre Channel, when 802.1AX-style link aggregation isn’t used.

It works great, and a fantastic way to get active/active links without running into spanning-tree issues and doesn’t require 802.1AX.

Then there’s… a type of link aggregation that scares me. This is FLAG.

killitwithfire

Some operating systems such as FreeBSD and Linux support a weird kind of link aggregation where packets are sent out various active links, but only received on one link. It requires no special configuration on a switch, but the server is oddly blasting out packets on various switch ports. Transmit is active/active, but receive is active/standby.

What’s the point? I’d prefer active/standby in a more sane configuration.  I think it would make troubleshooting much easier that way.

There’s not much need for this type of fake link aggregation anymore. Most managed switches support 802.1AX, and end hosts either support the aforementioned pinning or they support 802.1AX well (LACP or static). So there are easier ways to do it.

So as you can see, link aggregation is a pretty broad term, too broad to encompass only what would be under the umbrella of 802.1AX, as it also includes pinning and Fake Link Aggregation. LAG isn’t a good term either, since it refers to a specific instance, and isn’t suited as the catch-all term for the methodology of inverse-multiplexing. 802.1AX is probably the best term, but it’s not widely known, and it also includes the optional LACP control plane protocol. Perhaps we need a new term. But if you’ve found the terms confusing, you’re not alone.

That Moment When You Realized You “write erase” the Wrong Device…

ohshitdata

EtherChannel and Port Channel

In the networking world, you’ve no doubt heard the terms EtherChannel, port channel, LAG, MLAG, etc. These of course refer to taking multiple Ethernet connections and treating them as a single link. But one of the more confusing aspects I’ve run into is what’s the difference, if any, between the term EtherChannel and port channel? Well, I’m here to break it down for you.

break-it-down

OK, not that kind of break-it-down

First, let’s talk about what is vendor-neutral and what is Cisco trademark. EtherChannel is a Cisco trademarked term (I’m not sure if port channel is), while the vendor neutral term is LAG (Link Aggregation). Colloquially, however, I’ve seen both Cisco terms used with non-Cisco gear. For instance: “Let’s setup an Etherchannel between the Arista switch and the Juniper switch”. It’s kind of like in the UK using the term “hoovering” when the vacuum cleaner says Dyson on the side.

So what’s the difference between EtherChannel and port channel? That’s a good question. I used to think that EtherChannel was the name of the technology, and port channel was a single instance of that technology. But in researching the terms, it’s a bit more complicated than that.

Both Etherchannel and port channel appear in early Cisco documentation, such as this CatOS configuration guide. (Remember configuring switches with the “set” command?) In that document, it seems that port channel was used as the name of the individual instance of Etherchannel, just as I had assumed.

imright

I love it when I’m right

And that seems to hold true in this fairly recent document on Catalyst IOS 15, where EtherChannel is the technology and port channel is the individual instance.

But wait… in this older CatOS configuration guide, it explicitly states:

This document uses the term “EtherChannel” to refer to GEC (Gigabit EtherChannel), FEC (Fast EtherChannel), port channel, channel, and port group.

So it’s a bit murkier than I thought. And that’s just the IOS world. In the Nexus world, EtherChannel as a term seems to be falling out of favor.

Take a look at this Nexus 5000 CLI configuration guide for NXOS 4.0, and you see they use the term EtherChannel. By NX-OS 5.2, the term seems to have changed to just port channel. In the great book NX-OS and Cisco Nexus Switching, port-channel is used as the term almost exclusively. EtherChannel is mentioned once that I can see.

So in the IOS world, it seems that EtherChannel is the technology, and port channel is the interface. In the Nexus world, port channel is used as the term for the technology and the individual interface, though sometimes EtherChannel is referenced.

It’s likely that port channel is preferred in the Nexus world because NX-OS is an offspring of SANOS, which Cisco initially developed for the MDS line of Fibre Channel switches. Bundling Fibre Channels ports on Cisco switches isn’t called EtherChannels, since those interfaces aren’t, well, Ethernet. The Fibre Channel bundling technology is instead called a SAN port channel. The command on a Nexus switch to look at a port cchannel is “show port-channel”, while on IOS switches its “show etherchannel”.

When a dual-homed technology was developed on the Nexus platform, it was called vPC (Virtual Port Channel) instead of VEC (Virtual EtherChannel).

Style Guide

Another interesting aspect to this discussion is that EtherChannel is capitalized as a proper noun, while port channel is not. In the IOS world, it’s EtherChannel, though when its even mentioned in the Nexus world, it’s sometimes Etherchannel, without the capital “C”. Port channel is written often as port channel or port-channel (the later is used almost exclusively in the NX-OS book).

So where does that leave the discussion? Well, I think in very general terms, if you’re talking about Cisco technology, Etherchannel, EtherChannel, port channel, port channel, and LAG are all acceptable term for the same concept. When discussing IOS, it’s probably more correct to use the term Etherchannel. When discussing NX-OS, port channel. But again, either way would work.

Follow

Get every new post delivered to your Inbox.

Join 73 other followers