Link Aggregation Confusion

In a previous article, I discussed the somewhat pedantic question: “What’s the difference between EtherChannel and port channel?” The answer, as it turns out, is none. EtherChannel is mostly an IOS term, and port channel is mostly an NXOS term. But either is correct.

But I did get one thing wrong. I was using the term LAG incorrectly. I had assumed it was short for Link Aggregation (the umbrella term of most of this). But in fact, LAG is short for Link Aggregation Group, which is a particular instance of link aggregation, not the umbrella term. So wait, what do we call the technology that links links together?

saymyname

LAG? Link Aggregation? No wait, LACP. It’s gotta be LACP.

In case you haven’t noticed, the terminology for one of the most critical technologies in networking (especially the data center) is still quite murky.

Before you answer that, let’s throw in some more terms, like LACP, MLAG, MC-LAG, VLAG, 802.3ad, 802.1AX, link bonding, and more.

The term “link aggregation” can mean a number of things. Certainly EtherChannel and port channels are are form of link aggregation. 802.3ad and 802.1AX count as well. Wait, what’s 802.1AX?

802.3ad versus 802.1AX

What is 802.3ad? It’s the old IEEE working group for what is now known as 802.1AX. The standard that we often refer to colloquially as port channel, EtherChannels, and link aggregation was moved from the 802.3 working group to the 802.1 working group sometime in 2008. However, it is sometimes still referred to as 802.3ad. Or LAG. Or link aggregation. Or link group things. Whatever.

spaceghost

What about LACP? LACP is part of the 802.1AX standard, but it is neither the entirety of the 802.1AX standard, nor is it required in order to stand up a LAG.  LACP is also not link aggregation. It is a protocol to build LAGs automatically, versus static. You can usually build an 802.1AX LAG without using LACP. Many devices support static and dynamic LAGs. VMware ESXi 5.0 only supported static LAGs, while ESXi 5.1 introduced LACP as a method as well.

Some devices only support dynamic LAGs, while some only support static. For example, Cisco UCS fabric interconnects require LACP in order to setup a LAG (the alternative is to use pinning, which is another type of link aggregation, but not 802.1AX). The discontinued Cisco ACE 4710 doesn’t support LACP at all, instead only static LAGs are supported.

One way to think of LACP is that it is a control-plane protocol, while 802.1AX is a data-plane standard. 

Is Cisco’s EtherChannel/port channel proprietary?

As far as I can tell, no, they’re not. There’s no (functional at least) difference between 802.3ad/802.1ax and what Cisco calls EtherChannel/port channel, and you can set up LAGs between Cisco and non-Cisco without any issue.  PAgP (Port Aggregation Protocol), the precursor to LACP, was proprietary, but Cisco has mostly moved to LACP for its devices. Cisco Nexus kit won’t even support PAgP.

Even in LACP, there’s no method for negotiating the load distribution method. Each side picks which method it wants to do. In fact, you don’t have to have the same load distribution method configured on both ends of a LAG (though it’s usually a good idea).

There is are also types of link aggregation that aren’t part of the 802.1AX or any other standard. I group these types of link aggregation into two types: Pinning, and fake link aggregation. Or FLAG (Fake Link Aggregation).

First, lets talk about pinning. In Ethernet, we have the rule that there can’t be more than one way to get anywhere. Ethernet can’t handle multi-pathing, which is why we have spanning-tree and other tricks to prevent there from being more than one logical way for an Ethernet frame to get from one source MAC to a given destination MAC. Pinning is a clever way to get around this.

The most common place we tend to see pinning is in VMware. Most ESXi hosts have multiple connections to a switch. But it doesn’t have to be the same switch. And look at that, we can have multiple paths. And no spanning-tree protocol. So how do we not melt down the network?

The answer is pinning. VMware refers to this as load balancing by virtual port ID. Each VM’s vNIC has a virtual port ID, and that ID is pinning to one and only one of the external physical NICs (pNICs). To utilize all your links, you need at least as many virtual ports as you do physical ports. And load distributation can be an issue. But generally, this pinning works great. Cisco UCS also uses pinning for both Ethernet and Fibre Channel, when 802.1AX-style link aggregation isn’t used.

It works great, and a fantastic way to get active/active links without running into spanning-tree issues and doesn’t require 802.1AX.

Then there’s… a type of link aggregation that scares me. This is FLAG.

killitwithfire

Some operating systems such as FreeBSD and Linux support a weird kind of link aggregation where packets are sent out various active links, but only received on one link. It requires no special configuration on a switch, but the server is oddly blasting out packets on various switch ports. Transmit is active/active, but receive is active/standby.

What’s the point? I’d prefer active/standby in a more sane configuration.  I think it would make troubleshooting much easier that way.

There’s not much need for this type of fake link aggregation anymore. Most managed switches support 802.1AX, and end hosts either support the aforementioned pinning or they support 802.1AX well (LACP or static). So there are easier ways to do it.

So as you can see, link aggregation is a pretty broad term, too broad to encompass only what would be under the umbrella of 802.1AX, as it also includes pinning and Fake Link Aggregation. LAG isn’t a good term either, since it refers to a specific instance, and isn’t suited as the catch-all term for the methodology of inverse-multiplexing. 802.1AX is probably the best term, but it’s not widely known, and it also includes the optional LACP control plane protocol. Perhaps we need a new term. But if you’ve found the terms confusing, you’re not alone.

EtherChannel and Port Channel

In the networking world, you’ve no doubt heard the terms EtherChannel, port channel, LAG, MLAG, etc. These of course refer to taking multiple Ethernet connections and treating them as a single link. But one of the more confusing aspects I’ve run into is what’s the difference, if any, between the term EtherChannel and port channel? Well, I’m here to break it down for you.

break-it-down

OK, not that kind of break-it-down

First, let’s talk about what is vendor-neutral and what is Cisco trademark. EtherChannel is a Cisco trademarked term (I’m not sure if port channel is), while the vendor neutral term is LAG (Link Aggregation). Colloquially, however, I’ve seen both Cisco terms used with non-Cisco gear. For instance: “Let’s setup an Etherchannel between the Arista switch and the Juniper switch”. It’s kind of like in the UK using the term “hoovering” when the vacuum cleaner says Dyson on the side.

So what’s the difference between EtherChannel and port channel? That’s a good question. I used to think that EtherChannel was the name of the technology, and port channel was a single instance of that technology. But in researching the terms, it’s a bit more complicated than that.

Both Etherchannel and port channel appear in early Cisco documentation, such as this CatOS configuration guide. (Remember configuring switches with the “set” command?) In that document, it seems that port channel was used as the name of the individual instance of Etherchannel, just as I had assumed.

imright

I love it when I’m right

And that seems to hold true in this fairly recent document on Catalyst IOS 15, where EtherChannel is the technology and port channel is the individual instance.

But wait… in this older CatOS configuration guide, it explicitly states:

This document uses the term “EtherChannel” to refer to GEC (Gigabit EtherChannel), FEC (Fast EtherChannel), port channel, channel, and port group.

So it’s a bit murkier than I thought. And that’s just the IOS world. In the Nexus world, EtherChannel as a term seems to be falling out of favor.

Take a look at this Nexus 5000 CLI configuration guide for NXOS 4.0, and you see they use the term EtherChannel. By NX-OS 5.2, the term seems to have changed to just port channel. In the great book NX-OS and Cisco Nexus Switching, port-channel is used as the term almost exclusively. EtherChannel is mentioned once that I can see.

So in the IOS world, it seems that EtherChannel is the technology, and port channel is the interface. In the Nexus world, port channel is used as the term for the technology and the individual interface, though sometimes EtherChannel is referenced.

It’s likely that port channel is preferred in the Nexus world because NX-OS is an offspring of SANOS, which Cisco initially developed for the MDS line of Fibre Channel switches. Bundling Fibre Channels ports on Cisco switches isn’t called EtherChannels, since those interfaces aren’t, well, Ethernet. The Fibre Channel bundling technology is instead called a SAN port channel. The command on a Nexus switch to look at a port cchannel is “show port-channel”, while on IOS switches its “show etherchannel”.

When a dual-homed technology was developed on the Nexus platform, it was called vPC (Virtual Port Channel) instead of VEC (Virtual EtherChannel).

Style Guide

Another interesting aspect to this discussion is that EtherChannel is capitalized as a proper noun, while port channel is not. In the IOS world, it’s EtherChannel, though when its even mentioned in the Nexus world, it’s sometimes Etherchannel, without the capital “C”. Port channel is written often as port channel or port-channel (the later is used almost exclusively in the NX-OS book).

So where does that leave the discussion? Well, I think in very general terms, if you’re talking about Cisco technology, Etherchannel, EtherChannel, port channel, port channel, and LAG are all acceptable term for the same concept. When discussing IOS, it’s probably more correct to use the term Etherchannel. When discussing NX-OS, port channel. But again, either way would work.

VXLAN: Millions or Billions?

I was putting slides together for my upcoming talk and there is some confusion about VXLAN in particular, how many VLANs it provides.

The VXLAN header provides a 24-bit address space called the VNI (VXLAN Network Identifier) to separate out tenant segments, which is 16 million. And that’s the number I see quoted with regards to VXLAN (and NVGRE, which also has a 24-bit identifier). However, take a look at the entire VXLAN packet (from the VXLAN standard… whoa, how did I get so sleepy?):

vxlan

Tony’s amazing Technicolor Packet Nightmare

The full 802.1Q Ethernet frame can be encapsulated, providing the full 12-bit 4096 VLANs per VXLAN. 16 million multiplied by 4096 is about 68 billion (with a “B”). However, most material discussing VXLAN refers to 16 million.

So is it 16 million or 68 billion?

billionsandbillionsofvlans

The answer is: Yes?

So according to the standard, each VXLAN segment is capable of carrying an 802.1Q encoded Ethernet frame, so each VXLAN can have a total of 4096(ish) VLANs. The question is whether or not this is actually feasible. Can we run multiple VLANs over VXLAN? Or is each VXLAN only going to realistically carry a single (presumably non-tagged) VLAN.

I think much of this depends on how smart the VTEP is. The VTEP is the termination point, the encap/decap point for the VXLANs. Regular frames enter a VTEP and get encapsulated and sent over the VXLAN overlays (regular Layer 3 fabric) to another VTEPthe terminating endpoint and decap’d.

The trick is the MAC learning process of the VTEPs. Each VTEP is responsible for learning the local MAC addresses as well as the destination MAC addresses, just like a traditional switch’s CAM table. Otherwise, each VTEP would act kind of like a hub, and send every single unicast frame to every other VTEP associated with that VXLAN.

What I’m wondering is, do VTEPs keep separate MAC tables per VLAN?

I’m thinking it must create a multi-VLAN table, because what happens if we have the same MAC address in two different VLANs? A rare occurrence  to be sure, but I don’t think it violates any standards (could be wrong on that). If it only keeps a single MAC table for all VLANs, then we really can’t run multiple VLANs per VXLAN. But I imagine it has to keep multiple tables per VLAN. Or at least, it should.

I can’t imagine there would be any situation where different tenants would get VLANs in the same VXLAN/VNI, so there are still 16 million multi-tenant segments, so it’s not exactly 68 billion VLANs.  But each tenant might be able to have multiple VLANs.

Having tenants capable of having multiple VLAN segments may prove to be useful, though I doubt any tenant would have more than a handful of VLANs (perhaps DMZ, internal, etc.). I haven’t played enough with VXLAN software yet to figure this one out, and discussions on Twitter (many thanks to @dkalintsev for great discussions) while educational haven’t seemed to solidify the answer.

Ethernet Congestion: Drop It or Pause It

Congestion happens. You try to put a 10 pound (soy-based vegan) ham in a 5 pound bag, it just ain’t gonna work. And in the topsy-turvey world of data center switches, what do we do to mitigate congestion? Most of the time, the answer can be found in the wisdom of Snoop Dogg/Lion.

dropitlikephraell

Of course, when things are fine, the world of Ethernet is live and let live.

everythingisfine

We’re fine. We’re all fine here now, thank you. How are you?

But when push comes to shove, frames get dropped. Either the buffer fills up and tail drop occurs, or QoS is configured and something like WRED (Weight Random Early Detection) kicks in to proactively drop frames before taildrop can occur (mostly to keep TCP’s behavior from causing spiky behavior).

buffertaildrop

The Bit Grim Reaper is way better than leaky buckets

Most congestion remediation methods involve one or more types of dropping frames. The various protocols running on top of Ethernet such as IP, TCP/UDP, as well as higher level protocols, were written with this lossfull nature in mind. Protocols like TCP have retransmission and flow control, and higher level protocols that employ UDP (such as voice) have other ways of dealing with the plumbing gets stopped-up. But dropping it like it’s hot isn’t the only way to handle congestion in Ethernet:

stophammertime

Please Hammer, Don’t PAUSE ‘Em

Ethernet has the ability to employ flow control on physical interfaces, so that when congestion is about to occur, the receiving port can signal to the sending port to stop sending for a period of time. This is referred to simply as 802.3x Ethernet flow control, or as I like to call it, old-timey flow control, as it’s been in Ethernet since about 1997. When a receive buffer is close to being full, the receiving side will send a PAUSE frame to the sending side.

PAUSEHAMMERTIME

Too legit to drop

A wide variety of Ethernet devices support old-timey flow control, everything from data center switches to the USB dongle for my MacBook Air.

Screen Shot 2013-02-01 at 6.04.06 PM

One of the drawbacks of old-timey flow control is that it pauses all traffic, regardless of any QoS considerations. This creates a condition referred to as HoL (Head of Line) blocking, and can cause higher priority (and latency sensitive) traffic to get delayed on account of lower priority traffic. To address this, a new type of flow control was created called 802.1Qbb PFC (Priority Flow Control).

PFC allows a receiving port send PAUSE frames that only affect specific CoS lanes (0 through 7). Part of the 802.1Q standard is a 3-bit field that represents the Class of Service, giving us a total of 8 classes of service, though two are traditionally reserved for control plane traffic so we have six to play with (which, by the way, is a lot simpler than the 6-bit DSCP field in IP). Utilizing PFC, some CoS values can be made lossless, while others are lossfull.

Why would you want to pause traffic instead of drop traffic when congestion occurs?

Much of the IP traffic that traverses our data centers is OK with a bit of loss. It’s expected. Any protocol will have its performance degraded if packet loss is severe, but most traffic can take a bit of loss. And it’s not like pausing traffic will magically make congestion go away.

But there is some traffic that can benefit from losslessness, and and that just flat out requires it. FCoE (Fibre Channel of Ethernet), a favorite topic of mine, requires losslessness to operate. Fibre Channel is inherently a lossless protocol (by use of B2B or Buffer to Buffer credits), since the primary payload for a FC frame is SCSI. SCSI does not handle loss very well, so FC was engineered to be lossless. As such, priority flow control is one of the (several) requirements for a switch to be able to forward FCoE frames.

iSCSI is also a protocol that can benefit from pause congestion handling rather than dropping. Instead of encapsulating SCSI into FC frames, iSCSI encapsulates SCSI into TCP segments. This means that if a TCP segment is lost, it will be retransmitted. So at first glance it would seem that iSCSI can handle loss fine.

From a performance perspective, TCP suffers mightily when a segment is lost because of TCP congestion management techniques. When a segment is lost, TCP backs off on its transmission rate (specifically the number of segments in flight without acknowledgement), and then ramps back up again. By making the iSCSI traffic lossless, packets will be slowed down during congestions but the TCP congestion algorithm wouldn’t be used. As a result, many iSCSI vendors recommend turning on old-timey flow control to keep packet loss to a minimum.

However, many switches today can’t actually do full losslessness. Take the venerable Catalyst 6500. It’s a switch that would be very common in data centers, and it is a frame murdering machine.

The problem is that while the Catalyst 6500 supports old-timey flow control (it doesn’t support PFC) on physical ports, there’s no mechanism that I’m aware of to prevent buffer overruns from one port to another inside the switch. Take the example of two ingress Gigabit Ethernet ports sending traffic to a single egress Gigabit Ethernet port. Both ingress ports are running at line rate. There’s no signaling (at least that I’m aware of, could be wrong) that would prevent the egress ports from overwhelming the transmit buffer of the ingress port.

congestion

Many frames enter, not all leave

This is like flying to Hawaii and not reserving a hotel room before you get on the plane. You could land and have no place to stay. Because there’s no way to ensure losslessness on a Catalyst 6500 (or many other types of switches from various vendors), the Catalyst 6500 is like Thunderdome. Many frames enter, not all leave.

thunderdome

Catalyst 6500 shown with a Sup2T

The new generation of DCB (Data Center Bridging) switches, however, use a concept known as VoQ (Virtual Output Queues). With VoQs, the ingress port will not send a frame to the egress port unless there’s room. If there isn’t room, the frame will stay in the ingress buffer until there’s room.If the ingress buffer is full, it can have signaled the sending port it’s connected to to PAUSE (either old-timey pause or PFC).

This is a technique that’s been in used in Fibre Channel switches from both Brocade and Cisco (as well as others) for a while now, and is now making its way into DCB Ethernet switches from various vendors. Cisco’s Nexus line, for example, make use of VoQs, and so do Brocade’s VCS switches. Some type of lossless ability between internal ports is required in order to be a DCB switch, since FCoE requires losslessness.

DCB switches require lossless backplanes/internal fabrics, support for PFC, ETS (Enhanced Transmission Selection, a way to reserve bandwidth on various CoS lanes), and DCBx (a way to communicate these capabilities to adjacent switches). This makes them capable of a lot of cool stuff that non-DCB switches can’t do, such as losslessness.

One thing to keep in mind, however, is when Layer 3 comes into play. My guess is that even in a DCB switch that can do Layer 3, losslessness can’t be extended beyond a Layer 2 boundary. That’s not an issue with FCoE, since it’s only Layer 2, but iSCSI can be routed.

Po-tay-to, Po-ta-to: Analogies and NPIV/NPV

In a recent post, I took a look at the Fibre Channel subjects of NPIV and NPV, both topics covered in the CCIE Data Center written exam (currently in beta, take yours now, $50!). The post generated a lot of comments. I mean, a lot. Over 50 so far (and still going).  An epic battle (although very unInternet-like in that it was very civil and respectful) brewed over how Fibre Channel compares to Ethernet/IP. The comments look like the aftermath of the battle of Wolf 359.

Captain, the analogy regarding squirrels and time travel didn’t survive

One camp, lead by Erik Smith from EMC (who co-wrote the best Fibre Channel book I’ve seen so far, and it’s free), compares the WWPNs to IP addresses, and FCIDs to MAC addresses. Some others, such as Ivan Pepelnjak and myself, compare WWPNs to MAC addresses, and FCIDs to IP addresses. There were many points and counter-points. Valid arguments were made supporting each position. Eventually, people agreed to disagree. So which one is right? They both are.

Wait, what? Two sides can’t be right, not on the Internet!

When comparing Fibre Channel to Ethernet/IP, it’s important to remember that they are different. In fact, significantly different. The only purpose for relating Fibre Channel to Ethernet/IP is for the purpose of relating those who are familiar with Ethernet/IP to the world of Fibre Channel. Many (most? all?) people learn by building associations with known subjects (in our case Ethernet/IP)  to lesser known (in this case Fibre Channel) subjects.

Of course, any association includes includes its inherent inaccuracies. We purposefully sacrifice some accuracy in order to attain relatability. Specific details and inaccuracies are glossed over. To some, introducing any inaccuracy is sacrilege. To me, it’s being overly pedantic. Pedantic details are for the expert level. Using pedantic facts as an admonishment of an analogy misses the point entirely. With any analogy, there will always be inaccuracies, and there will always be many analogies to be made.

Personally, I still prefer the WWPN ~= MAC/FC_ID ~= IP approach, and will continue to use it when I teach. But the other approach I believe is completely valid as well. At that point, it’s just a matter of preference. Both roads lead to the same destination, and that is what’s really important.

Learning always happens in layers. Coat after coat is applied, increasing in accuracy and pedantic details as you go along. Analogies is a very useful and effective tool to learn any subject.

TRILLapalooza

If there’s one thing people lament in the routing and switching world, it’s the spanning tree protocol and the way Ethernet forwarding is done (or more specifically, it’s limitations). I’ve made my own lament last year (don’t cross the streams), and it’s come up recently in Ivan Pepelnjak’s blog.  Even server admins who’ve never logged into a switch in their lives know what spanning-tree is: It is the destroyer of uptime, causer of the Sev1 events, a pox among switches.

I am become spanning-tree, destroyer of networks

The root of the problem is the way Ethernet forwarding is done: There can’t be more than one path for an Ethernet frame to take from a source MAC to a destination MAC. This basic limitation has not changed for the past few decades.

And yet, for all the outages, all of the configuration issues and problems spanning-tree has caused, there doesn’t seem to be much enthusiasm for the more fundamentals cures: TRILL (and the current proprietary implementations), SPB, QFabric, and to a lesser extent OpenFlow (data center use cases), and others. And although the OpenFlow has been getting a lot of hype, it’s more because VCs are drooling over it than from its STP-slieghing ways.

For a while in the early 2000s, it looked like we might get rid of it for the most part. There was a glorious time when we started to see multi-layer switches that could route Layer 3 as fast as they could switch Layer 2, giving us the option of getting rid spanning-tree entirely. Every pair of switches, even at the access layer, would be it’s own Layer 3 domain. Everything was routed to everywhere, and the broadcast domains were very small so there wasn’t a possibility for Ethernet to take multiple paths. And with Layer 3 routing, multi-pathing was easy through ECMP. Convergence on a failed link was way faster than spanning tree.

Then virtualization came, and screwed it all up. Now Layer 3 wasn’t going to work for a lot o the workloads, and we needed to build huge Layer 2 networks. Although some non-virtualization uses, the Layer-3 everywhere solution works great. To take a look at a wonderfully multi-path, high bandwidth environment, check out Brad Hedlund’s own blog entry on creating a Hadoop super network with shit-tons of bandwidth out of 10G and 40G low latency ports.

Overlays

Which brings me to overlays. There are some that propose overlay networks, such as VXLAN, NVGRE, and Nicira as solutions to the Ethernet multipathing problem (among other problems). An overlay technology like VXLAN not only brings us back to the glory days of no spanning-tree by routing to the access layer, but solves another issue that plagues large scale deployments: 4000+ VLANs ain’t enough. VXLAN for instance has a 24-bit identifier on top of the normal 12-bit 802.1Q VLAN identifier, so that’s 236 separate broadcast domains, giving us the ability to support 68,719,476,736 VLANs. Hrm.. that would be….

While I like (and am enthusiastic) about overlay technologies in general, I’m not convinced they are the  final solution we need for Ethernet’s current forwarding limitations. Building an overlay infrastructure (at least right now) is a more complicated (and potentially more expensive) prospect than TRILL/SPB, depending on how you look at it. Also availability is an issue currently (likely to change, of course), since NVGRE has no implementations I’m aware of, and VXLAN only has one (Cisco’s Nexus 1000v). Also, VXLAN doesn’t terminate into any hardware currently, making it difficult to put in load balancers and firewalls that aren’t virtual (as mentioned in the Packet Pusher’s VXLAN podcast).

Of course, I’m afraid TRILL doesn’t have it much better in the way of availability. Only two vendors that I’m aware of ship TRILL-based products, Brocade with VCS and Cisco with FabricPath, and both FabricPath and VCS only run on a few switches out of their respective vendor’s offerings. As has often been discussed (and lamented), TRILL has a new header format, new silicon is needed to implement TRILL (or TRILL-based) offerings in any switch. So sadly it’s not just a matter of adding new firmware, the underlying hardware needs to support it too. For instance, the Nexus 5500s from Cisco can do TRILL (and the code has recently been released) while the Nexus 5000 series cannot.

It had been assumed that the vendors that use merchant silicon for their switches (such as Arista and Dell Force10) couldn’t do TRILL, because the merchant silicon didn’t. Turns out, that’s not the case. I’m still not sure which chips from the merchant vendors can and can’t do TRILL, but the much ballyhooed Broadcom Trident/Trident+ chipset (BCM56840 I believe, thanks to #packetpushers) can do TRILL. So anything built on on Trident should be able to do TRILL. Which right now is a ton of switches. Broadcom is making it rain Tridents right now. The new Intel/Fulcrum chipsets can do TRILL as well I believe.

And TRILL though does have the advantage of boing stupid easy. Ethan Banks and I were paired up during NFD2 at Brocade, and tasked with configuring VCS (built on pre-standard TRILL). It took us 5 minutes and just a few commands. FabricPath (Cisco’s pre-standard implementation built on TRILL) is also easy: 3 commands. If you can’t configure FabricPath, you deserve the smug look you get from Smug Cisco Guy. Here is how you turn on FabricPath on a Nexus 7K:

switch# config terminal 
switch(config)# feature-set fabricpath
switch(config)# mac address learning-mode conversational vlan 1-10 

Non-overlay solutions to STP without TRILL/SPB/QFabirc/etc. include MLAG (commonly known as Cisco’s trademarked  term Etherchannel) and MC-LAG (Multi-chassis Link Aggregration), also known as VLAG, vPC, VSS depending on the vendor. They also provide multi-pathing in a sense that while there are multiple active physical paths, no single flow will have more than one possible path, providing both redundancy and full link utilization. But it’s all manually configured at each link, and not nearly as flexible (or easy) as TRILL to instantiate. MLAG/MC-LAG can provide simple multi-path scenarios, while TRILL is so flexible, you can actually get yourself into trouble (as Ivan has mentioned here). So while MLAG/MC-LAG work as workarounds, why not just fix what they workaround? It would be much simpler.

Vendor Lock-In or FUD?

Brocade with VCS and Cisco with FabricPath are currently proprietary implementations of TRILL, and won’t work with each other or any other version of TRILL. The assumption is that when TRILL becomes more prevalent, they will have standards-based implementations that will interoperate (Cisco and Brocade have both said they will). But for now, it’s proprietary. Oh noes! Some vendors have decried this as vendor lock-in, but I disagree. For one, you’re not going to build a multi-vendor fabric, like staggering two different vendors every other rack. You might not have just one vendor amongst your networking gear, but your server switch blocks, core/aggregation, and other such groupings of switches are very likely to be single vendor. Every product has a “proprietary boundary” (new term! I made it!). Even token ring, totally proprietary, could be bridged to traditional Ethernet networks. You can also connect your proprietary TRILL fabrics to traditional STP domains at the edge (although there are design concerns as Ivan Pepelnjak has noted).

QFabric will never interoperate with another vendor, that’s their secret sauce (running on Broadcom Trident+ if the rumors are to be believed). Still, QFabric is STP-less, so I’m a fan. And like TRILL, it’s easy. My only complaint about QFabric right now is that it requires a huge port count (500+ 10 Gbit ports) to make sense (so does Nexus 7000K with TRILL, but you can also do 5500s now). Interestingly enough, Juniper’s Anjan Venkatramani did a hit piece on TRILL, but the joke is on them because it’s on tech target behind a register-wall, so no one will read it.

So far, the solutions for Ethernet forwarding are as follows: Overlay networks (may be fantastic for large environments, though very complex), Layer 3 everywhere (doable, but challenges in certain environments), and MLAG/MCAG (tough to scale, manual configuration but workable). All of that is fine. I’ve nothing against any of those technologies. In fact, I’m getting rather excited about VXLAN/Nicira overlays. I still think we should fix Layer 2 forwarding with TRILL, SPB, or something like it. And while even if every vendor went full bore on one standard, it would be several years before we were able to totally rid spanning-tree in our networks.

But wouldn’t it be grand?

Further resources on TRILL (and where to feast on brains)

OpenFlow’s Awkward Teen Years

During the Networking Field Day 3 (The Nerdening) event, I attended a presentation by NEC (yeah, I know, turns out they make switches and have a shipping OpenFlow product, who knew?).

This was actually the second time I’ve seen this presentation. The first was at Networking Field Day 2 (The Electric Boogaloo), and my initial impressions can be found on my post here.

So what is OpenFlow? Well, it could be a lot of things. More generically it could be considered Software Defined Networking (SDN). (All OpenFlow is SDN, but not all SDN is OpenFlow?) It’s adding a layer of intelligence to a network that we have previously lacked. On a more technical level, OpenFlow is a way to program the forwarding tables of L2/L3 switches using a standard API.

Rather than each switch building their own forwarding tables through MAC learning, routing protocols, and policy-based routing, the switches (which could be from multiple vendors) are brainless zombies, accepting forwarding table updates from a maniacal controller that can perceive the entire network.

This could be used for large data centers, large global networks (WAN), but the NEC demonstration was a data center fabric representation of OpenFlow. In some respects, it’s similar to Juniper’s QFabric and Brocade’s VCS. They all have a boundary, and the connections to networks outside of that boundary is done through traditional routing and switching mechanisms, such as OSPF or MLAG (Multi-chassis Link Aggregation). Juniper’s QFabric is based on Juniper’s own proprietary machinations, Brocade’s VCS is based on TRILL (although not [yet?] interopable with other TRILL implementations), and NEC’s OpenFlow is based on, well, OpenFlow.

While it was the same presentation, the delegates (a few of us had seen the presentation before, most had not) were much tougher on NEC than we were last time. Or at least I’m sure it seemed that way. The fault wasn’t NEC, it was the fact that understanding of Openflow and its capabilities have increased, and that’s the awkward teen years of any technology. We were poking and prodding, thinking of new uses and new implications, limitations and caveats.

As we start to understand a technology better, we start to see it’s potential, as well as probe it’s potential drawbacks. Everything works great in PowerPoint, after all.  So while our questions seemed tougher, don’t take that as a sign that we’re souring on Openflow. We’re only tough because we care.

One thing that OpenFlow will likely not be is a way to manipulate individual flows, or at least all individual flows. The “flow” in Openflow does not necessary (and in all likelihood would rarely, if ever) represent an individual 6 tuple TCP connection or UDP flow (a source and destination MAC, IP, and TCP/UDP port). One of the common issues that Openflow haters/doubters/skeptics have brought up that an Openflow controller can only program about 700 flows per second into a switch. That’s certainly not enough to handle a site that may see hundreds of thousands of new TCP connections/UDP flows per second.

But that’s not what Openflow is meant to do, nor is that a huge issue. Think about routing protocols and Ethernet forwarding mechanisms. Neither handle deal with specific flows, only general flows (all traffic to a particular MAC address goes to this port, this /8 network goes out this interface, etc.). OpenFlow isn’t any different. So a limit of 700 flows per second per switch? Not a big deal.

OpenFlow is another way the build an Ethernet fabric, which means offering services and configurations beyond just a Layer 2/Layer 3 network.

Think of firewalls as how we deal with them now. Scaling is an issue, so they’re often choke points. You have to direct traffic in and out of them (NAT, transparent mode, routed mode), and they’re often deployed in pairs. N+1 firewalls is not that common (and often a huge pain in the ass to configure, although it’s been a while).  With Openflow (or SDN in general) it’s possible to define an endpoint (VM or physical) and say that endpoint requires flows pass through a firewall. Since we can steer flows (again, not on a per-individual flow basis, but general catch-most rules) scaling a firewall isn’t an issue. Need more capacity? Throw another firewall/IPS/IDS on the barbie. Openflow could put in forwarding rules on the switches and steer flows between active/active/active firewalls. These services could also be tied to the VM, and not the individual switch port (which is a characteristic of a fabric).

Myself, I’m all for any technology that abolishes spanning-tree/traditional Layer 2 forwarding. It’s an assbackwards way of doing things, and if that’s how we ran Layer 3 networks, networking administrators would have revolted by now. (I’m surprised more network administrators aren’t outraged by spanning-tree, but that’s another blog post).

NEC’s OpenFlow (and OpenFlow in general) is an interesting direction in the effort to make network more intelligent (aka “fabrics”). Some vendors are a bit dismissive of OpenFlow, and of fabrics in general. My take is that fabrics are a good thing, especially with virtualization. But more on that later. NEC’s OpenFlow product is interesting, but like any new technology it’s got a lot to prove (and the fact that it’s NEC means they have even more to prove).

Disclaimer: I was invited graciously by Stephen Foskett and Greg Ferro to be a delegate at Networking Field Day 3. My time was unpaid. The vendors who presented indirectly paid for my meals and lodging, but otherwise the vendors did not give me any compensation for my time, and there was no expectation (specific or implied) that I write about them, or write positively about them. I’ve even called a presenter’s product shit, because it was. Also, I wrote this blog post mostly in Aruba while drunk.

A High Fibre Diet: Twisted Pair Strikes Back

I saw a tweet recently from storage and virtualization expert Stu Miniman regarding Emulex announcing copper 10GBase-T Converged Network Adapters, running 10 Gigabit Ethernet over copper (specifically Cat 6a cable).

I recalled a comment I heard Greg Ferro made on a packet pushers episode (and subsequent blog post) about copper not being reliable enough for storage, with the specific issue being the bit error rate (BER), how how many errors the standard (FC, Ethernet, etc.) will allow over a physical medium. As we’ve talked about before, networking people tend to be a little more devil-may-care about their bits, where as storage folks get all anal rententive chef about their bits.

For 1 Gigabit Ethernet over copper (802.3ab/1000Base-T), the standard calls for a  goal BER of less than 10-10, or one wrong bit in every 10,000,000,000 bits. Which incidentally, is one error every second for a line rate 10 Gigabit Ethernet.  For Gigabit, that’s on error every 10 seconds, or 6 per minute.

Fibre Channel has a BER goal of less than 10-12, or on error in every 1,000,000,000,000 bits. That would be about 2 errors a minute with 10 Gigabit Ethernet.  That’s also 100 times less error-prone than Ethernet, which if you think about it, is a lot.

To give a little scale, that’s like comparing Barney Fife from The Andy Griffith show’s bad assery to Jason Statham’s character in.. well any movie he’s ever been in.

Holy shit, is he fighting… truancy?

Barney Fife, the 10-10 error rate of law enforcement. Wait… Wow, did I really just say that?

So given how fastidious about their storage networks storage folks can be, it’s understandable that storage administrator wouldn’t want their precious SCSI commands running over a network that’s 100 times less reliable than Fibre Channel.

However, while the Gigabit Ethernet standard has a BER target of less than 10-10, the 802.3an standard for 10 Gigabit Ethernet over copper (10GBaseT) has a BER goal of less than 10-12, which is in line with Fibre Channel’s goal. So is 10 Gigabit Ethernet over Cat 6A good enough for storage (specifically FCoE)? Sounds like it.

But the discussion also got me thinking, how close do we get to 10-10 as an error rate in Gigabit Ethernet? I just checked all the physical interfaces in my data center (laundry room), and every error counter is zero (presumably most errors would show up as CRC errors). And all it takes to hit 1010 bits is 1.25 Gigabytes of data transfer, and I do that when I download a movie off of iTunes.  So I know I’ve put dozens of gigs through my desktop since it was last rebooted, and nary an error. And my cabling isn’t exactly data center standard. One cable I use came with a cheap wireless access point I got a while ago. It makes me curious to what the actual BER is in reality with decent cables that don’t come close to 100 meters.

Of course, there’s still the power consumption issues and other drawbacks that Greg mentioned when compared to fiber (or coax). However, it’ll be good to have another option. There are some shops that won’t likely ever have fiber optics deployed.

Gigamon Side Story

The modern data center is a lot like modern air transportation. Not nearly as sexy as it used to be, the food isn’t nearly as good as it used to be, and more choke points than we used to deal with.

With 10 Gigabit Ethernet Fabrics available from vendors like Cisco, Juniper, Brocade, et all, we can conceive of these great, non-blocking, lossless networks that let us zip VMs and data to and fro.

And then reality sets in. The security team needs to inspection points. That means firewalls, IPS, and IDS devices. And one thing they’re not terribly good at? Gigs and gigs of traffic. Also scaling. And not pissing me off.

Pictured: Firewall Choke Points

This battle between scalability and security has data center administrators and security groups rumbling like some sort of West Side Data Center Story.

Dun dun da dun! Scalability!

Dun doo doo ta doo! Inspection!

So what to do? Enter Gigamon, the makers of the orangiest network devices you’ll find in a data center. They were part of Networking Field Day 2, which I participated in back in October.

Essentially what Gigamon allows you to do is scale out your SPAN/Mirror ports. On most Cisco switches, only two ports at a time can be spitting mirrored traffic. For something like a Nexus 7000 with up to 256 10 Gigabit Interfaces, that’s usually not sufficient for monitoring anything but a small smattering of your traffic.

A product like Gigamon can tap fibre and copper links, or take in the output of a span port, classify the traffic, and send it out an appropriate port. This would allow a data center to effectively scale traffic monitoring in a way that’s not possible with mere mirrored ports alone. It would effectively remove all choke points that we normally associate with security. You’d just need to scale up with the appropriate number of IDS/IPS devices.

But with great power, comes the ability to do some unsavory things. During the presentation Gigamon mentioned they’d just done a huge install with Russia (note: I wouldn’t bring that up in your next presentation), allowing the government to monitor data of its citizens. That made me less than comfortable (and it’s also why it scares the shit out of Jeremy Gaddis). But “hey, that’s how Russia rolls” you might say. We do it here in the US, as well, through the concept of “lawful interception“. Yeah, I did feel a little dirty after that discussion.

Still, it could be used for good by removing the standard security choke points. Even if you didn’t need to IPS every packet in your data center, I would still consider architecting a design with Gigamon or another vendor like them in mind. It wouldn’t be difficult to consider where to put the devices, and it could save loads of time in the long run. If a security edict came down from on high, the appropriate devices would be put in place with Gigamon providing the pipping without choking your traffic.

In the mean time, I’m going to make sure everything I do is SSL’d.

Note: As a delegate/blogger, my travel and accommodations were covered by Gestalt IT, who vendors paid to have spots during the Networking Field Day. Vendors pay Gestalt IT to present, so while my travel (hotel, airfare, meals) were covered indirectly by the vendors, no other remuneration (save for the occasional tchotchke) from any of the vendors, directly or indirectly, or by Gestalt IT was recieved. Vendors were not promised, nor did they ask for any of us to write about them, or write about them positively. In fact, we sometimes say their products are shit (when, to be honest, sometimes they are, although this one wasn’t). My time was unpaid. 

A Tale of Two FCoEs

A favorite topic of discussion among the data center infrastructure crowd is the state of FCoE. Depending on who you ask, FCoE is dead, stillborn, or thriving.

So, which is it? Are we dealing with FUD or are we dealing with vendor hype?  Is FCoE a success, or is it a failure? The quick answer is.. yes? FCoE is both thriving, and yet-to-launch. So… are we dealing with Schrödinger’s protocol?

Note quite. To understand the answer, it’s important to make to make the distinction with two very different ways that FCoE is implemented: Edge FCoE and Multi-hop FCoE (a subject I’ve written about before, although I’ve renamed things a bit).

Edge FCoE

Edge FCoE is thriving, and has been for the past few years. Edge FCoE is when you take a server (or sometimes a storage array), connect it to an FCoE switch. And everything beyond that first switch is either native Fibre Channel or native Ethernet.

Edge FCoE is distinct from Multi-hop for one main reason: It’s a helluva lot easier to pull off than multi-hop FCoE. With edge-FCoE, the only switch that needs to understand FCoE  is that edge FCoE switch. They plug into traditional Fibre Channel networks over traditional Fibre Channel links (typically with NPV mode).

Essentially, no other part of your network needs to do anything you haven’t done already. You do traditional Ethernet, and traditional Fibre Channel. FCoE only exists in that first switch, and is invisible to the rest of your LAN and SAN.

Here are the things you (for the most part) don’t have to worry about configuring on your network with Edge FCoE:

  • Data Center Bridging (DCB) technologies
    • Priority Flow Control (PFC) which enables lossless Ethernet
    • Enhanced Transmission Selection (ETS) allowing the ability to dedicate bandwidth to various traffic (not required but recommended -Ivan Pepelnjak)
    • DCBx: A method to communicate DCB functionality between switches over LLDP (oh, hey, you do PFC? Me too!)
  • Whether or not your aggregation and core switches support FCoE (they probably don’t, or at least the line cards don’t)

There is PFC and DCBx in the server-to-edge FCoE link, but it’s typically inherint, and supported by the CNA and the edge-FCoE switch and turned on by default or auto-detected. In some implementations, there’s nothing to configure. PFC is there, and un-alterable. Even if there are some settings to tweak, it’s generally easier to do it on edge ports than on a aggregation/core network.

Edge FCoE is the vast majority of how FCoE is implemented today. Everyone from Cisco’s UCS to HP C7000 series can do it, and do it well.

Multi-Hop

The very term multi-hop FCoE is controversial in nature (just check the comments section of my terminology FCoE article), but for the sake of this article, multi-hop FCoE is any topological implmentation of FCoE where FCoE frames move around a converged network beyond a single switch.

Multi-hop FCoE requires a couple of things: It requires a Fibre Channel-aware network, losslessness through priority flow control (PFC), DCBx (Data Center Bridging Exchange), enhanced transmission selection (ETS),  and you’ve got a recipe for a switch that I’m pretty sure ain’t in your rack right now. For instance, the old man of the data center, the Cisco Catalyst 6500, doesn’t now, and will likely never do FCoE.

Switch-wise, there are two types of ways to do multi-hop FCoE: A switch can either forward FCoE frames based on the Ethernet headers (MAC address source/destination), or you can forward frames based on the Fibre Channel headers (FCID source/destionation).

Ethernet-forwarded/Pass-through Multi-hop

If you build a multi-hop network with switches that forward based on Ethernet headers (as Juniper and Brocade do), then you’ll want something other than spanning-tree to do loop prevention and enable multi-pathing. Brocade uses a method based on TRILL, and Juniper uses their proprietary QFabric (based on unicorn tears).

Ethernet-forwaded FCoE switches don’t have a full Fibre Channel stack, so they’re unaware of what goes on in the Fibre Channel world, such as zoning and with the exception of the FIP (FCoE Initiation Protocol), which handles discovery of attached Fibre Channel devices (connecting virtual N_Ports to virtual F_Ports).

FC-Forwarded/Dual-stack Multi-hop

If you build a multi-hop network with switches that forward based on Fibre Channel headers, your FCoE switch needs to have both a full DCB-enabled Ethernet stack, and a full Fibre Channel stack. This is the way Cisco does it on their Nexus 5000s, Nexus 7000s, and MDS 9000 (with FCoE line cards), although the Nexus 4000 blade switch is the Ethernet-forwarded kind of switch.

The benefit of using a FC-Forwarded switch is that you don’t need a network that does TRILL or anything fancier than spanning-tree (spanning-tree isn’t enabled on any VLAN that passes FCoE). It’s pretty much a Fibre Channel network, with the ports being Ethernet instead of Fibre Channel. In fact, in Cisco’s FCoE reference design, storage and networking traffic are still port-gaped (a subject of a future blog post). FCoE frames and regular networking frames don’t run over the same links, there are dedicated FCoE links.

It’s like running a Fibre Channel SAN that just happens to sit on top of your Ethernet network. As Victor Moreno the LISP project manager at Cisco says: “The only way is to overlay”.

State of FCoE

It’s not accurate to say that FCoE is dead, or that FCoE is a success, or anything in between really, because the answer is very different once you separate multi-hop and edge-FCoE.

Currently, multi-hop has yet to launch in a significant way. In the past 2 months, I have heard rumors of a customer here or there implementing it, but I’ve yet to hear any confirmed reports or first hand tales. I haven’t even configured it personally. I’m not sure I’m quite as wary as Greg Ferro is, but I do agree with his wariness. It’s new, it’s not widely deployed, and that makes it riskier. There are interopability issues, which in some ways are obviated by the fact no one is doing Ethernet fabrics in a multi-vendor way, and NPV/NPIV can help keep things “native”. But historically, Fibre Channel vendors haven’t played well together. Stephen Foskett lists interopability among his reasonable concerns with FCoE multi-hop. (Greg, Stephen, and everyone else I know are totally fine with edge FCoE.)

Edge-FCoE is of course vibrant and thriving. I’ve configured it personally, and it works easily and seamlessly into an existing FC/Ethernet network. I have no qualms about deploying it, and anyone doing convergence should at least consider it.

Crystal Ball

In terms of networking and storage, it’s impossible to tell what the future will hold. There are a number of different directions FCoE, iSCSI, NFS, DCB, Ethernet Fabrics, et all could go. FCoE could end up replacing Fibre Channel entirely, or it could be relegated to the edge and never move from there. Another possibility as suggested to me by Stephen Foskett is that Ethernet will become the connection standard for Fibre Channel devices. They would still be called Fibre Channel switches, and SANs setup just like they always have been, but instead of having 8/16/32 Gbit FC ports, they’d have 10/40/100 Gbit Ethernet ports. To paraphrase Bob Metcalfe, “I don’t know what will come after Fibre Channel, but it will be called Ethernet”.