Fibre Channel and FCoE: Some Basics

There’s been some misconceptions and misinformation lately about FCoE. Like any technology, there are times when it makes sense and times when it doesn’t, but much of the anti-FCoE talk lately has been primarily ignorance and/or wilful misrepresentation.

In an effort to fight that ignorance, I put together a quick introduction to how FC and FCoE works. They both operate on the basic premise that you can’t drop any frames. Fibre Channel was built as a lossless protocol, and with a bit of work, Ethernet can also be lossless.

Check it out:

Learn what Russ Fellows Doesn’t Know

So how’s this for a condescending tweet?

It’s from Russ Fellows, author of the infamous FCoE “study” (which has been widely debunked for its many hilarious errors):

Interesting article (check it out). But the sad/amusing irony is that he’s wrong. How is he wrong? Here’s what Russ Fellows doesn’t know about storage:

1, 2, 4, and 8 Gbit Fibre Channel (as he points out) uses 8/10 bit encoding. That means about a 20% of the bandwidth available was lost due to encoding overhead (as Russ pointed out). That’s why 8 Gbit Fibre Channel only provides 800 MB/s of connectivity, even though 8,000 Megabits per second equates to 1,000 Megabytes per second (8000 Megabits / (8 bits per byte) = 1,000 Megabytes).

With this overhead in mind, Fibre Channel was designed to give 100 MB/s for every Gigabit of speed. It never increased the baud rate to make up for the overhead.

Ethernet, on the other hand, did increase the baud rate to make up for the overhead. Gigabit Ethernet uses the same 8/10 bit encoding, but they kicked the baud rate up to 1.25 gigabaud to make up the differences. As such, Gigabit Ethernet provides true 1 gigabit of throughput, or 125 Megabytes per second.

10 Gigabit Ethernet moved to 64/66 encoding, and kept to the approach of not letting the overhead impact throughput. 10 Gigabit Ethernet then provides 1250 Megabytes per second of throughput. The baud rate is 10.3125, giving true 10 Gigabit per second of data.

When Fibre Channel moved to the more efficient 64/66 bit encoding, rather than change the 100 MB/s per gigabit to 125 MB/s (which you get with all Ethernet speeds), they left the ratio (1 Gigabit to 100 MB/s) the same. Thus, every Gigabit = 100 MB/s, just like in previous speeds (1/2/4/8 FC). So while 16 Gbit Fibre Channel provides 1600 MB/s of throughput, the baud rate is actually only 14 gigabaud, and not true 16 Gbit. And don’t take my word for it, check out page 7 of Scott Shimomura‘s (of Brocade) presentation at the SPDE conference.

  • 1 Gbit Fibre Channel = 100 MB/s
  • 1 Gbit Ethernet = 125 MB/s
  • 2 Gbit Fibre Channel = 200 MB/s
  • 4 Gbit Fibre Channel = 400 MB/s
  • 8 Gbit Fibre Channel = 800 MB/s
  • 10 Gbit Ethernet/FCoE = 1250 MB/s
  • 16 Gbit Fibre Channel = 1600 MB/s

10 Gigabit Ethernet provides 1250 MB/s, providing true 10 Gigabit Ethernet, and not putting the slight overhead into the equation. So while 10 Gigabit Ethernet is true 10 Gigabit, 16 Gigabit Fibre Channel is actually 14 Gigabit Fibre Channel (14.025, to be exact).

And that’s what Russ Fellows doesn’t know. His entire article is based on a false premise: Thinking that the move to 64/66 makes 16 Gbit pass more than twice as much traffic as 8 Gbit. But it’s not. He says that with 8 Gbit FC, 1+1 = 1.6 (when compared to 16 Gbit FC), which is factually incorrect for the reasons I’ve just explained. Yes, 64/66 bit encoding is more efficient. But they dropped the baud rate, negating the efficiency gains

8 Gigabit Fibre Channel provides 800 Megabytes per second of data transfer. 16 Gigabit Fibre Channel (really 14 Gigabit Fibre Channel) provides 1600 Megabytes per second of data transfer. 800 + 800 = 1600.

Sorry Russ, 1+1 really does equal 2. Even in Fibre Channel.

micdrop

Top 5 Reasons The Evaluator Group Screwed Up

It’s been a while since the trainwreck of a “study” commissioned by Brocade and performed by The Evaluator Group,  but it’s still being discussed in various storage circles (and that’s not good news for Brocade). Some pretty much parroted the results, seemingly without reading the actual test. Then got all pissy when confronted about it.  I did a piece on my interpretations of the results, as did Dave Alexander of WWT and J Metz of Cisco. Our mutual conclusion can be best summed up with a single animated GIF.

 

bullshit

But since a bit of time has passed, I’ve had time to absorb Dave and J’s opinions, as well as others, I’ve come up with a list of the Top 5 Reasons by The Evaluator Group Screwed Up. This isn’t the complete list, of course, but some of the more glaring problems. Let’s start with #1:

Reason #1: I Have No Idea What I’m Doing

Their hilariously bad conclusion to the higher variance in response times and higher CPU usage was that it was the cause of the software initiators. Except, they didn’t use software initiators. The had actually configured hardware initiators, and didn’t know it. Let that sink in: They’re charged with performing an evaluation, without knowing what they’re doing.

The Cisco UCS VIC 1240 hardware CNA’s were utilized.  Referring to them as software initiators caused some confusion. The Cisco VIC is a hardware initiator and we configured them with virtual HBAs. Evaluator Group has no knowledge of the internal architecture of the VIC or its driver.  Our commentary of the possible cause for higher CPU utilization is our opinion and further analysis would be required to pinpoint the specific root cause.

Of course, it wasn’t the software initiator. They didn’t use a software initiator, but they were so clueless, they didn’t know they’d actually used a hardware initiator. Without knowing how they performed their tests (since they didn’t publish their methodology) it’s purely speculation, but it looks like the problem was caused by congestion (from them architecting the UCS solution incorrectly).

Reason #2: They’re Hilariously Bad At Math.

They claimed FCoE required 50% more cables, based on the fact that there were 50% more cables in the FCoE solution than the FC solution. Which makes sense… except that the FC system had zero Ethernet.

That’s right, in the HP/Fibre Channel solution, each blade had absolutely zero Ethernet connectivity. In the Cisco UCS solution, every blade had full Ethernet and Fibre Channel connectivity.  None. Zilch. Why did they do that? Probably because had they included any network connectivity to the HP system, the cable count would have shifted to FCoE’s favor.  Let me state this again, because it’s astonishingly stupid: They claimed FCoE (which included Ethernet and FC connectivity) required more cables without including any network connectivity for the HP/FC system. 

not_even_mad

Also, they made some power/cooling claims, despite the fact that the UCS solution didn’t require a separate FC switch (it’s capable of being a full-fledged Fibre Channel switch by itself), though the HP solution would have required a separate pair of Ethernet switches (which wasn’t included). So yeah, their math is a bit off. Had they done things, you know, correctly, the power, cooling, and cable count would have flipped in favor of FCoE.

Reason #3: UCS is Hard, You Guys!

They whinged about UCS being more difficult to setup. Anytime you’re dealing with unfamiliar technology, it’s natural that it’s going to be more difficult. However, they claimed that they had zero experience with HP as well (seriously, who at Brocade hired these guys?) How easy is UCS? Here is a video done from Amsterdam where a couple of Cisco techs added a new chassis and blade and had it booted up and running ESXi in less than 30 minutes from in the box to booted. Cisco UCS is different than other blade systems, but it’s also very easy (and very quick) to stand up. And keep in mind, the video I linked was done in Amsterdam, so they were probably baked   

Reason #4: It Contradicts Everyone Else’s Results (Especially those that know what they’re doing)

For the past couple of years, VMware and NetApp have been doing performance tests on various storage protocols. Here’s one from a few years ago, which includes (native) 4 and 8 Gbit Fibre Channel, 10 Gbit FCoE, 10 Gbit iSCSI, and 10 Gbit NFS. The conclusion? The protocol doesn’t much matter. They all came out about the same when normalized for bandwidth. The big difference is in the storage backend. At least they published their methodology (I’m looking at you, Evaluator Group). Here’s one from Demartek that shows a mixture of storage protocols saturating 10 Gbit Ethernet. Again, the limitation is only the link speed itself, not the protocol. And again, again, Demartek published their methodology.

Reason #5: How Did They Set Everything Up? Magic!

Most of the time with these commissioned reports, the details of how it’s configured are given so that the results can be reproduced and audited. How did the Evaluator Group set up their environment?

GOB.MAGIC_.GIF_

As far as I can tell, magic. There’s several things they could have easily gotten wrong with the UCS setup, and given their mistake about software/hardware initiators, quite likely. They didn’t even mention which storage vendor they used.

So there you have it. A bit of a re-hash, but hey, it was a dumb report. The upside though is that it did provide me with some entertainment.

Fibre Channel: The Heart of New SDN Solutions

From Juniper to Cisco to VMware, companies are spouting up new SDN solutions. Juniper’s Contrail, Cisco’s ACI, VMware’s NSX, and more are all vying to be the next generation of data center networking. What is surprising, however, is what’s at the heart of these new technologies.

Is it VXLAN, NVGRE, Openflow? Nope. It’s Fibre Channel.

Seriously.

If you think about it, it makes sense. Fibre Channel has been doing fabrics since before we ever called Ethernet fabrics, well, fabrics. And this isn’t the first time that Fibre Channel has shown up in unusual places. There’s a version of Fibre Channel that runs inside certain airplanes, including jet fighters like the F-22.

1_FW_F-22_Raptor_participates_in_Red_Flag

Keep the skies safe from FCoE (sponsored by the Evaluator Group)

New generation of switches have been capable of Data Center Bridging (DCB), which enables Fibre Channel over Ethernet. These chips are also capable of doing native Fibre Channel So rather than build complicated VPLS fabrics or routed networks, various data center switching companies are leveraging the inherent Fibre Channel capabilities of the merchant silicon and building Fibre Channel-based underlay networks to support an IP-based overlay.

Buffer-to-buffer (B2B) credit system and losslessness of Fibre Channel, plus the new 32/128 Gigabit interfaces with the newest Fibre Channel standard are all being leveraged for these underlays. I find it surprising that so many companies are adopting this, you’d think it’d be just Brocade. But Cisco, Arista (who notoriously shunned FCoE) and Juniper are all on board with new or announced SDN offerings that are based mostly or in part on Fibre Channel.

However, most of the switches from various vendors are primarily Ethernet today, so the 10/40 Gigabit interfaces can run FCoE until more switches are available with native FC interfaces. Of course, these switches will still be required to have a number of native Ethernet ports in order to connect to border networks that aren’t part of the overlay network, so there will be still a need for Ethernet. But it seems the market has spoken, and they want Fibre Channel.

 

Changing Data Center Workloads

Networking-wise, I’ve spent my career in the data center. I’m pursuing the CCIE Data Center. I study virtualization, storage, and DC networking. Right now, the landscape in the network is constantly changing, as it has been for the past 15 years. However, with SDN, merchant silicon, overlay networks, and more, the rate of change in a data center network seems to be accelerating.

speed

Things are changing fast in data center networking. You get the picture

Whenever you have a high rate of change, you’ll end up with a lot of questions such as:

  • Where does this leave the current equipment I’ve got now?
  • Would SDN solve any of the issues I’m having?
  • What the hell is SDN, anyway?
  • I’m buying vendor X, should I look into vendor Y?
  • What features should I be looking for in a data center networking device?

I’m not actually going to answer any of these questions in this article. I am, however, going to profile some of the common workloads that you find in data centers currently. Your data center may have one, a few, or all of these workloads. It may not have any of them. Your data center may have one of the workloads listed, but my description and/or requirements is way off. All certainly possible. These are generalizations, and with all generalizations your mileage may vary. With that disclaimer out of the way, strap in. Let’s go for a ride.

Traditional Virtualization

It’s interesting to say that something which only exploded into the data center in a big way in about 2008 as now being “traditional”, rather than “new-fangled”. But that’s the situation we have here. Traditional virtualization workload  is centered primarily around VMware vSphere. There are other traditional virtualization products of course, such as Red Hat’s RHEV, Xen, and Microsoft Hyper-V, but VMware has the largest market share for this by far.

  • Latency is not a huge concern (30 usecs not a big deal)
  • Layer 2 adjacencies are mandatory (required for vMotion)
  • Large Layer 2 domains (thousands of hosts layer 2 adjacent)
  • Converged infrastructures (storage and data running on the same wires, FCoE, iSCSI, NFS, SMB3, etc.)
  • Buffer requirements aren’t typically super high. Bursting isn’t much of an issue for most workloads of this type.
  • Fibre Channel is often the storage protocol of choice, along with NFS and some iSCSI as well

Cisco has been especially successful in this realm with the Nexus line because of vPC, FabricPath, OTV and (to a much lesser extent) LISP, as they address some of the challenges with workload mobility (though not all of them, such as the speed of light). Arista, Juniper, and many others also compete in this particular realm, but Cisco is the market leader.

With the multi-pathing Layer 2 technologies such as SPB, TRILL, Cisco FabricPath, and Brocade VCS (the latter two are based on TRILL), you can build multi-spine leaf/spine networks/CLOS networks that you can’t with spanning-tree based networks, even with MLAG.

This type of network is what I typically see in data centers today. However, there is a shift towards Layer 3 networks and cloud workloads over traditional virtualization, so it will be interested to see how long traditional virtualization lasts.

VDI

VDI (Virtual Desktops) are a workload with the exact same requirements as traditional virtualization, with one main difference: The storage requirements are much, much higher.

  • Latency is not as important (most DC-grade switches would qualify), especially since latency is measured in milliseconds for remote desktop users
  • Layer 2 adjacencies are mandatory (required for vMotion)
  • Large Layer 2 domains
  • Converged infrastructures
  • Buffer requirements aren’t typically very high
  • High-end storage backends. All about the IOPs, y’all

For storage here, IOPs are the biggest concern. VDI eats IOPs like candy.

Legacy Workloads

This is the old, old school. And by old school, I mean late 90s, early 2000s. Before virtualization changed the landscape. There’s still quite a few crusty old servers, with uptimes measured in years, running long-abandoned applications. The problem is, these types of applications are usually running something mission critical and/or significant revenue generating. Organizations just haven’t found a way out of it yet. And hey, they’re working right now. Often running on proprietary Unix systems, they couldn’t or wouldn’t be migrated to a virtualized environment (where it would be much easier to deal with).

The hardware still works, so why change something that works? Because it would be tough to find more. It’s also probably out of vendor-supported service.

  • Latency? Who cares. Is it less than 1 second? Good enough.
  • Layer 2 adjacencies, if even required, are typically very small, typically just needed for the local clustering application (which is usually just stink-out-loud awful)
  • 100 megabit and gigabit Ethernet typically. 10 Gigabit? That’s science-fiction talk!
  • Buffers? You mean like, what shines the floor?

buffers

My own personal opinion is that this is the only place where Cisco Catalyst switches belong in a data center, and even then only because they’re already there. If you’re going with Cisco, I think everything else (and everything new) in the DC should be Nexus.

Cloud Workloads (Private Cloud)

If you look at a cloud workload, it looks very similar to the previous traditional virtualization workload. They both use VMs sitting on top of hypervisors. They both have underlying infrastructure of compute, network, and storage to support these VMs. The difference is primarily is in the operational model.

It’s often described as the difference between pets and cattle. With traditional virtualization, you have pets. You care what happens to these VMs. They have HA and DRS and other technologies to care for them. They’re given clever names, like Bart and Lisa, or Happy and Sleepy. With cloud VMs, they’re not given fun names. We don’t do vMotion/Live Migration with them. When we need them, they’re spun up. When they’re not, they’re destroyed. We don’t back them up, we don’t care if the host they reside on dies so long as there are other hosts carrying the workload. The workload is automatically sharded across the available hosts using logic in the application. Instead of backups, templates are used to create new VMs when the workload increases. And when the workload decreases, some of the VMs get destroyed. State is not kept on any single VM, instead the state of the application (and underlying database) is sharded to the available systems.

This is very different than traditional virtualization. Because the workload distribution is handled with the application, we don’t need to do vMotion and thus have Layer 2 adjacencies. This makes it much more flexible for the network architects to put together network to support this type of workload. Storage with this type of workload also tends to be IP-based (NFS, iSCSI) rather than FC-based (native Fibre Channel or FCoE).

With cloud-based workloads, there’s also a huge self-service component. VMs are spun-up and managed by developers or end-users, rather than the IT staff. There’s typically some type of portal that end-users can use to spin up/down resources. Chargebacks are also a component, so that even in a private cloud setting, there’s a resource cost associated and can be tracked.

OpenStack is a popular choice for these cloud workloads, as is Amazon and Windows Azure. The former is a private cloud, with the later two being public cloud.

  • Latency requirements are mostly the same as traditional virtualization
  • Because vMotion isn’t required, it’s all Layer 3, all the time
  • Storage is mostly IP-based, running on the same network infrastructure (not as much Fibre Channel)
  • Buffer requirements are typically the same as traditional virtualization
  • VXLAN/NVGRE burned into the chips for SDN/Overlays

You can use much cheaper switches for this type of network, since the advanced Layer 2 features (OTV, FabricPath, SPB/TRILL, VCS) aren’t needed. You can build a very simple Layer 3 mesh using inexpensive and lower power 10/40/100 Gbit ports.

However, features such as VXLAN/NVGRE encap/decap is increasingly important. The new Trident2 chips from Broadcom support this now, and several vendors, including Cisco, Juniper, and Arista all have switches based on this new SoC (switch-on-chip) from Broadcom.

High Frequency Trading

This is a very specialized market, and one that has very specialized requirements.

  • Latency is of the utmost concern. To the point of making sure ports are on the same ASIC. Latency is measured in nano-seconds, microseconds are an eternity
  • 10 Gbit at the very least
  • Money is typically not a concern
  • Over-subscription is non-existent (again, money no concern)
  • Buffers are a trade off, they can increase latency but also prevent packet loss

This is a very niche market, one that Arista dominates. Cisco and a few other vendors have small inroads here, using the same merchant silicon that Arista uses, however Arista has had huge experience in this market. Every tick of the clock can mean hundreds of thousands of dollars in a single trade, so companies have no problem throwing huge amounts of money at this issue to shave every last nanosecond off of latency.

Hadoop/Big Data

  • Latency is of high concern
  • Large buffers are critical
  • Over-subscription is low
  • Layer 2 adjacency is neither required nor desired
  • Layer 3 Leaf/spine networks
  • Storage is distributed, sharded over IP

Arista has also extremely successful in this market. They glue PC RAM onto their switch boards to provide huge buffers (around 760 MB) to each port, so it can absorb quite a bit of bursty traffic, which occurs a lot in these types of setups. That’s about .6 seconds of buffering a 10 Gbit link. Huge buffers will not prevent congestion, but they do help absorb situations where you might be overwhelmed for a short period of time.

Since nodes don’t need to be Layer 2 adjacent, simple Layer 3 ECMP networks can be created using inexpensive and basic switches. You don’t need features like FabricPath, TRILL, SPB, OTV. Just fast, inexpensive, low power ports. 10 Gigabit is the bare minimum for these networks, with 40 and 100 Gbit used for connectivity to the spines. Arista (especially with their 7500E platform) does very well in this area. Cisco is moving into this area with the Nexus 9000 line, which was announced late last year.

 

Conclusions

Understanding the requirements for the various workloads may help you determine the right switches for you. It’s interesting to see how quickly the market is changing. Perhaps 2 years ago, the large-Layer 2 networks seemed like the immediate future. Then all of a sudden Layer 3 mesh networks became popular again. Then you’ve got SDN like VMware’s NSX and Cisco’s ACI on top of that. Interesting times, man. Interesting times.

Jumbo Fibre Channel Frames

In the world of Ethernet, jumbo frames (technically any Ethernet frame larger than 1,500 bytes) is often a recommendation for certain workloads, such as iSCSI, vMotion, backups, basically anything that doesn’t communicate with the Internet because of MTU issues. And in fact, MTU issues is one of the biggest hurdles with Ethernet jumbo frames, since every device on a given LAN must have the correct MTU size set if you want them to successfully communicate with each other.

What’s less known is that you can also enable jumbo frames for Fibre Channel as well, and that doing so can have dramatic benefits with certain workloads. Best of all, since SANs are typically smaller in scope, ensuring every device has jumbo frames enabled is much easier.

Fibre Channel Frames

Fibre Channel frames typically have a maximum payload size of 2112, and with with headers makes the MTU 2148 bytes. However you can increase the payload up to 9000 bytes, and with headers that 9036 bytes. You can play with the payload size if you like, but just to make it easier I either do regular MTU (2148 bytes) or I do the max for most switches (9036). My perfunctory tests don’t seem to indicate much benefit with anything in between.

Supported

Of course, not all Fibre Channel devices can handle jumbo frames. It’s part of the T10 standard though since 2005, so most modern devices (anything capable of 4 Gbit or more, for the most part). This includes switches from Cisco and Brocade, as well as HBAs from Qlogic and Emulex.

Configuration

The way you configure jumbo frames of course varies, and of course I don’t have lots of different types of equipment, so I’m just going to demonstrate on an MDS that I have access to.

Note: Do not do anything in a production environment until you’ve tested it in a dev environment. If you do otherwise, you’re an idiot.

If you go into an interface configuration, you’ll probably notice there’s no MTU option. That’s because we’re dealing with a fabric here, and the MTU is set per VSAN, not per port. Multi-VSAN ISLs (TE_Ports) will do multiple MTUs without any problem. Keep in mind, changing MTUs on an MDS/Nexus requires a reboot (I think mostly because the N_Ports need to do a new FLOGI). I’m not sure about Brocade, however.

Switch#1# config t 
Enter configuration commands, one per line.  End with CNTL/Z. 
Switch#1(config)# mtu size vsan 10 9036
VSAN 10 MTU changed. Reboot required.
Switch#1(config)# mtu size vsan 20 9036
VSAN 20 MTU changed. Reboot required.
Switch#1(config)# exit 
Switch#1# reload

Right now the maximum MTU size per the standard is 9036 bytes, though it depends on the vendor. Brocade, with it’s Gen 6 FC (32 Gbit) is proposing a maximum MTU size of around 15,000 bytes, though it’s not set in stone yet. It’ll be interesting to see how this all plays out, that’s for sure.

Testing

How well does it work? I haven’t tested it for VDI or otherwise high-IOPs environments, but my suspicion is that it’s not going to help. Given the longer serialization, it would actually probably hurt performance. For other workloads, such as backups, they can make a dramatic difference. As a test, I did a couple of vSphere Storage vMotions, and I was surprised to see how fast it went.

Hosts were B200 M2 blades, with 96 GByte of RAM running EXI 5.1.  Fibre Channel cards were Cisco M81KR, configured for 9036 byte frames (9,100 bytes on the FCoE interface since VICs are CNAs, so had to add several dozen bytes for FCoE headers/VNTAG, etc.). The FC switch was actually a Cisco 6248 Fabric Interconnect set for Fibre Channel switching mode. All tests were done on the A fabric. I initiated Storage vMotions on a 2 TB VMDK file. The VMDK wasn’t receiving an active workload, so it was mostly idle. I ran the test 20 times for each size, and averaged the results.

svmotion

It’s just a quick and dirty test on a barely-active VMDK, but you can see that with 9000 Byte FC frames, it’s almost twice as fast.

Ethernet Congestion: Drop It or Pause It

Congestion happens. You try to put a 10 pound (soy-based vegan) ham in a 5 pound bag, it just ain’t gonna work. And in the topsy-turvey world of data center switches, what do we do to mitigate congestion? Most of the time, the answer can be found in the wisdom of Snoop Dogg/Lion.

dropitlikephraell

Of course, when things are fine, the world of Ethernet is live and let live.

everythingisfine

We’re fine. We’re all fine here now, thank you. How are you?

But when push comes to shove, frames get dropped. Either the buffer fills up and tail drop occurs, or QoS is configured and something like WRED (Weight Random Early Detection) kicks in to proactively drop frames before taildrop can occur (mostly to keep TCP’s behavior from causing spiky behavior).

buffertaildrop

The Bit Grim Reaper is way better than leaky buckets

Most congestion remediation methods involve one or more types of dropping frames. The various protocols running on top of Ethernet such as IP, TCP/UDP, as well as higher level protocols, were written with this lossfull nature in mind. Protocols like TCP have retransmission and flow control, and higher level protocols that employ UDP (such as voice) have other ways of dealing with the plumbing gets stopped-up. But dropping it like it’s hot isn’t the only way to handle congestion in Ethernet:

stophammertime

Please Hammer, Don’t PAUSE ‘Em

Ethernet has the ability to employ flow control on physical interfaces, so that when congestion is about to occur, the receiving port can signal to the sending port to stop sending for a period of time. This is referred to simply as 802.3x Ethernet flow control, or as I like to call it, old-timey flow control, as it’s been in Ethernet since about 1997. When a receive buffer is close to being full, the receiving side will send a PAUSE frame to the sending side.

PAUSEHAMMERTIME

Too legit to drop

A wide variety of Ethernet devices support old-timey flow control, everything from data center switches to the USB dongle for my MacBook Air.

Screen Shot 2013-02-01 at 6.04.06 PM

One of the drawbacks of old-timey flow control is that it pauses all traffic, regardless of any QoS considerations. This creates a condition referred to as HoL (Head of Line) blocking, and can cause higher priority (and latency sensitive) traffic to get delayed on account of lower priority traffic. To address this, a new type of flow control was created called 802.1Qbb PFC (Priority Flow Control).

PFC allows a receiving port send PAUSE frames that only affect specific CoS lanes (0 through 7). Part of the 802.1Q standard is a 3-bit field that represents the Class of Service, giving us a total of 8 classes of service, though two are traditionally reserved for control plane traffic so we have six to play with (which, by the way, is a lot simpler than the 6-bit DSCP field in IP). Utilizing PFC, some CoS values can be made lossless, while others are lossfull.

Why would you want to pause traffic instead of drop traffic when congestion occurs?

Much of the IP traffic that traverses our data centers is OK with a bit of loss. It’s expected. Any protocol will have its performance degraded if packet loss is severe, but most traffic can take a bit of loss. And it’s not like pausing traffic will magically make congestion go away.

But there is some traffic that can benefit from losslessness, and and that just flat out requires it. FCoE (Fibre Channel of Ethernet), a favorite topic of mine, requires losslessness to operate. Fibre Channel is inherently a lossless protocol (by use of B2B or Buffer to Buffer credits), since the primary payload for a FC frame is SCSI. SCSI does not handle loss very well, so FC was engineered to be lossless. As such, priority flow control is one of the (several) requirements for a switch to be able to forward FCoE frames.

iSCSI is also a protocol that can benefit from pause congestion handling rather than dropping. Instead of encapsulating SCSI into FC frames, iSCSI encapsulates SCSI into TCP segments. This means that if a TCP segment is lost, it will be retransmitted. So at first glance it would seem that iSCSI can handle loss fine.

From a performance perspective, TCP suffers mightily when a segment is lost because of TCP congestion management techniques. When a segment is lost, TCP backs off on its transmission rate (specifically the number of segments in flight without acknowledgement), and then ramps back up again. By making the iSCSI traffic lossless, packets will be slowed down during congestions but the TCP congestion algorithm wouldn’t be used. As a result, many iSCSI vendors recommend turning on old-timey flow control to keep packet loss to a minimum.

However, many switches today can’t actually do full losslessness. Take the venerable Catalyst 6500. It’s a switch that would be very common in data centers, and it is a frame murdering machine.

The problem is that while the Catalyst 6500 supports old-timey flow control (it doesn’t support PFC) on physical ports, there’s no mechanism that I’m aware of to prevent buffer overruns from one port to another inside the switch. Take the example of two ingress Gigabit Ethernet ports sending traffic to a single egress Gigabit Ethernet port. Both ingress ports are running at line rate. There’s no signaling (at least that I’m aware of, could be wrong) that would prevent the egress ports from overwhelming the transmit buffer of the ingress port.

congestion

Many frames enter, not all leave

This is like flying to Hawaii and not reserving a hotel room before you get on the plane. You could land and have no place to stay. Because there’s no way to ensure losslessness on a Catalyst 6500 (or many other types of switches from various vendors), the Catalyst 6500 is like Thunderdome. Many frames enter, not all leave.

thunderdome

Catalyst 6500 shown with a Sup2T

The new generation of DCB (Data Center Bridging) switches, however, use a concept known as VoQ (Virtual Output Queues). With VoQs, the ingress port will not send a frame to the egress port unless there’s room. If there isn’t room, the frame will stay in the ingress buffer until there’s room.If the ingress buffer is full, it can have signaled the sending port it’s connected to to PAUSE (either old-timey pause or PFC).

This is a technique that’s been in used in Fibre Channel switches from both Brocade and Cisco (as well as others) for a while now, and is now making its way into DCB Ethernet switches from various vendors. Cisco’s Nexus line, for example, make use of VoQs, and so do Brocade’s VCS switches. Some type of lossless ability between internal ports is required in order to be a DCB switch, since FCoE requires losslessness.

DCB switches require lossless backplanes/internal fabrics, support for PFC, ETS (Enhanced Transmission Selection, a way to reserve bandwidth on various CoS lanes), and DCBx (a way to communicate these capabilities to adjacent switches). This makes them capable of a lot of cool stuff that non-DCB switches can’t do, such as losslessness.

One thing to keep in mind, however, is when Layer 3 comes into play. My guess is that even in a DCB switch that can do Layer 3, losslessness can’t be extended beyond a Layer 2 boundary. That’s not an issue with FCoE, since it’s only Layer 2, but iSCSI can be routed.

A Tale of Two FCoEs

A favorite topic of discussion among the data center infrastructure crowd is the state of FCoE. Depending on who you ask, FCoE is dead, stillborn, or thriving.

So, which is it? Are we dealing with FUD or are we dealing with vendor hype?  Is FCoE a success, or is it a failure? The quick answer is.. yes? FCoE is both thriving, and yet-to-launch. So… are we dealing with Schrödinger’s protocol?

Note quite. To understand the answer, it’s important to make to make the distinction with two very different ways that FCoE is implemented: Edge FCoE and Multi-hop FCoE (a subject I’ve written about before, although I’ve renamed things a bit).

Edge FCoE

Edge FCoE is thriving, and has been for the past few years. Edge FCoE is when you take a server (or sometimes a storage array), connect it to an FCoE switch. And everything beyond that first switch is either native Fibre Channel or native Ethernet.

Edge FCoE is distinct from Multi-hop for one main reason: It’s a helluva lot easier to pull off than multi-hop FCoE. With edge-FCoE, the only switch that needs to understand FCoE  is that edge FCoE switch. They plug into traditional Fibre Channel networks over traditional Fibre Channel links (typically with NPV mode).

Essentially, no other part of your network needs to do anything you haven’t done already. You do traditional Ethernet, and traditional Fibre Channel. FCoE only exists in that first switch, and is invisible to the rest of your LAN and SAN.

Here are the things you (for the most part) don’t have to worry about configuring on your network with Edge FCoE:

  • Data Center Bridging (DCB) technologies
    • Priority Flow Control (PFC) which enables lossless Ethernet
    • Enhanced Transmission Selection (ETS) allowing the ability to dedicate bandwidth to various traffic (not required but recommended -Ivan Pepelnjak)
    • DCBx: A method to communicate DCB functionality between switches over LLDP (oh, hey, you do PFC? Me too!)
  • Whether or not your aggregation and core switches support FCoE (they probably don’t, or at least the line cards don’t)

There is PFC and DCBx in the server-to-edge FCoE link, but it’s typically inherint, and supported by the CNA and the edge-FCoE switch and turned on by default or auto-detected. In some implementations, there’s nothing to configure. PFC is there, and un-alterable. Even if there are some settings to tweak, it’s generally easier to do it on edge ports than on a aggregation/core network.

Edge FCoE is the vast majority of how FCoE is implemented today. Everyone from Cisco’s UCS to HP C7000 series can do it, and do it well.

Multi-Hop

The very term multi-hop FCoE is controversial in nature (just check the comments section of my terminology FCoE article), but for the sake of this article, multi-hop FCoE is any topological implmentation of FCoE where FCoE frames move around a converged network beyond a single switch.

Multi-hop FCoE requires a couple of things: It requires a Fibre Channel-aware network, losslessness through priority flow control (PFC), DCBx (Data Center Bridging Exchange), enhanced transmission selection (ETS),  and you’ve got a recipe for a switch that I’m pretty sure ain’t in your rack right now. For instance, the old man of the data center, the Cisco Catalyst 6500, doesn’t now, and will likely never do FCoE.

Switch-wise, there are two types of ways to do multi-hop FCoE: A switch can either forward FCoE frames based on the Ethernet headers (MAC address source/destination), or you can forward frames based on the Fibre Channel headers (FCID source/destionation).

Ethernet-forwarded/Pass-through Multi-hop

If you build a multi-hop network with switches that forward based on Ethernet headers (as Juniper and Brocade do), then you’ll want something other than spanning-tree to do loop prevention and enable multi-pathing. Brocade uses a method based on TRILL, and Juniper uses their proprietary QFabric (based on unicorn tears).

Ethernet-forwaded FCoE switches don’t have a full Fibre Channel stack, so they’re unaware of what goes on in the Fibre Channel world, such as zoning and with the exception of the FIP (FCoE Initiation Protocol), which handles discovery of attached Fibre Channel devices (connecting virtual N_Ports to virtual F_Ports).

FC-Forwarded/Dual-stack Multi-hop

If you build a multi-hop network with switches that forward based on Fibre Channel headers, your FCoE switch needs to have both a full DCB-enabled Ethernet stack, and a full Fibre Channel stack. This is the way Cisco does it on their Nexus 5000s, Nexus 7000s, and MDS 9000 (with FCoE line cards), although the Nexus 4000 blade switch is the Ethernet-forwarded kind of switch.

The benefit of using a FC-Forwarded switch is that you don’t need a network that does TRILL or anything fancier than spanning-tree (spanning-tree isn’t enabled on any VLAN that passes FCoE). It’s pretty much a Fibre Channel network, with the ports being Ethernet instead of Fibre Channel. In fact, in Cisco’s FCoE reference design, storage and networking traffic are still port-gaped (a subject of a future blog post). FCoE frames and regular networking frames don’t run over the same links, there are dedicated FCoE links.

It’s like running a Fibre Channel SAN that just happens to sit on top of your Ethernet network. As Victor Moreno the LISP project manager at Cisco says: “The only way is to overlay”.

State of FCoE

It’s not accurate to say that FCoE is dead, or that FCoE is a success, or anything in between really, because the answer is very different once you separate multi-hop and edge-FCoE.

Currently, multi-hop has yet to launch in a significant way. In the past 2 months, I have heard rumors of a customer here or there implementing it, but I’ve yet to hear any confirmed reports or first hand tales. I haven’t even configured it personally. I’m not sure I’m quite as wary as Greg Ferro is, but I do agree with his wariness. It’s new, it’s not widely deployed, and that makes it riskier. There are interopability issues, which in some ways are obviated by the fact no one is doing Ethernet fabrics in a multi-vendor way, and NPV/NPIV can help keep things “native”. But historically, Fibre Channel vendors haven’t played well together. Stephen Foskett lists interopability among his reasonable concerns with FCoE multi-hop. (Greg, Stephen, and everyone else I know are totally fine with edge FCoE.)

Edge-FCoE is of course vibrant and thriving. I’ve configured it personally, and it works easily and seamlessly into an existing FC/Ethernet network. I have no qualms about deploying it, and anyone doing convergence should at least consider it.

Crystal Ball

In terms of networking and storage, it’s impossible to tell what the future will hold. There are a number of different directions FCoE, iSCSI, NFS, DCB, Ethernet Fabrics, et all could go. FCoE could end up replacing Fibre Channel entirely, or it could be relegated to the edge and never move from there. Another possibility as suggested to me by Stephen Foskett is that Ethernet will become the connection standard for Fibre Channel devices. They would still be called Fibre Channel switches, and SANs setup just like they always have been, but instead of having 8/16/32 Gbit FC ports, they’d have 10/40/100 Gbit Ethernet ports. To paraphrase Bob Metcalfe, “I don’t know what will come after Fibre Channel, but it will be called Ethernet”.

The Problem

One recurring theme from virtually every one of the Network Field Day 2 vendor presentations last week (as well as the OpenFlow symposium) was affectionately referred to as “The Problem”.

It was a theme because, as vendor after vendor gave a presentation, they essentially said the same thing when describing the problem they were going to solve. For us the delegates/bloggers, it quickly went from the problem to “The Problem”. We’d heard it over and over again so often that during the (5th?) iteration of the same problem we all started laughing like a group of Beavis and Butt-Heads during a vendor’s presentation, and we had to apologize profusely (it wasn’t their fault, after all).

Huh huhuhuhuhuh… he said “scalability issues”

In fact, I created a simple diagram with some crayons brought by another delegate to save everyone some time.

Hello my name is Simon, and I like to do draw-wrings

But with The Problem on repeat it became very clear that the majority of networking companies are all tackling the very same Problem. And imagine the VC funding that’s chasing the solution as well.

So what is “The Problem”? It’s multi-faceted and interrelated set of issues:

Virtualization Has Messed Things Up, Big Time

The biggest problem of them all was caused by the rise of virtualization. Virtualization has disrupted much of the server world, but the impact that it’s had on the network is arguably orders of magnitude greater. Virtualization wants big, flat networks, just when we got to the point where we could route Layer 3 as fast as we could switch Layer 2. We’d just gotten to the point where we could get our networks small.

And it’s not just virtualization in general, much of its impact is the very simple act of vMotion. VMs want to keep their IPs the same when they move, so now we have to bend over backwards to get it done. Add to the the vSwitch sitting inside the hypervisor, and the limited functionality of that switch (and who the hell manages it anyway? Server team? Network team?)

4000 VLANs Ain’t Enough

If you’re a single enterprise running your own network, chances are 4000+ VLANs are sufficient (or perhaps not). In multi-tenant environments with thousands of customers, 4000+ VLANs quickly becomes a problem. There is a need for some type of VLAN multiplier, something like QinQ or VXLAN, which gives us 4096 times 4096 VLANs (16 million or so).

Spanning Tree Sucks

One of my first introductions to networking was accidentally causing a bridging loop on a 10 megabit Ethernet switch (with a 100 Mbit uplink) as a green Solaris admin. I’d accidentally double-connected a hub, and I noticed the utilization LED on the switch went from 0% to 100% when I plugged a certain cable in. I entertained myself with plugging in and unplugging the port to watch the utilization LED flucutate (that is, until the network admin stormed in and asked what the hell was going on with his network).

And thus began my love affair with bridging loops. After the Brocade presentation where we built a TRILL-based Fabric very quickly, with active-active uplinks and nary a port in blocking mode, Ethan Banks became a convert to my anti-spanning tree cause.

OpenFlow offers an even more comprehensive (and potentially more impressive) solution as well. More on that later.

Layer 2 Switching Isn’t Scaling

The current method by which MAC addresses are learned in modern switches causes two problems: Only one viable path can be allowed at a time (only way to prevent loops is to prevent multiple paths by blocking ports), and large Layer 2 networks involve so many MAC addresses that it doesn’t scale.

From QFabric, to TRILL, to OpenFlow (to half a dozen other solutions), Layer 2 transforms into something Layer 3-like. MAC addresses are routed just like IP addresses, and the MAC address becomes just another tuple (another recurring word) for a frame/packet/segment traveling from one end of your datacenter to another. In the simplest solution (probably TRILL?) MAC learning is done at the edge.

There’s A Lot of Shit To Configure

Automation is coming, and in a big way. Whether it’s a centralized controller environment, or magical software powered by unicorn tears, vendors are chomping at the bit to provide some sort of automation for all the shit we need to do in the network and server world. While certainly welcomed, it’s a tough nut to crack (as I’ve mentioned before in Automation Conundrum).

Data center automation is a little bit like the Gom Jabbar. They tried and failed you ask? They tried and died.

“What’s in the box?”

“Pain. And an EULA that you must agree to. Also, man-years of customization. So yeah, pain.”

Ethernet Rules Everything Around Me

It’s quite clear that Ethernet has won the networking wars. Not that this is any news to anyone who’s worked in a data center for the past ten years, but it has struck me that no other technology has been so much as even mentioned as one for the future. Bob Metcalfe had the prophetic quote that Stephen Foskett likes to use: “I don’t know what will come after Ethernet, but it will be called Ethernet.”

But there are limitations (Layer 2 MAC learning, virtualization, VLANs, storage) that need to be addressed for it to become what comes after Ethernet. Fibre Channel is holding ground, but isn’t exactly expanding, and some crazy bastards are trying to merge the two.

Oof. Storage.

Most people agree that storage is going to end up on our network (converged networking), but there are as many opinions on how to achieve this network/storage convergence as there are nerd and pop culture reference in my blog posts. Some companies are pro-iSCSI, others pro FC/NFS, and some like Greg Ferro have the purest of all hate: He hates SCSI.

“Yo iSCSI, I’m really happy for you and imma let you finish, but Fibre Channel is the best storage protocol of all time”

So that’s “The Problem”. And for the most part, the articles on Networking Field Day, and the solutions the vendors propose will be framed around The Problem.

Your Momma Is So Proprietary

Let’s talk about a very sensitive subject for both networking admins and networking vendors: The subject of proprietary technologies.

The word proprietary in most cases has a very negative connotation. Most network designers would prefer that everything be based on open standards, like OSPF and (shudder) Spanning Tree. After all, IP and Ethernet are open standards, and those along with many other open standard technologies, make the Internet and industry what it is today. But at the same time, we can be a bit hypocritical, in that we also tend to want awesome features that are often on the propriety side.

Conversely, most network vendors would love to come up with the Kernel’s Secret Recipe that makes their stuff so awesome, that no sane engineer would dare use anything else. But they also like to say they’re open, in order to allay fears that a customer might have of being “locked in”. So when vendors go after customers, you’ll hear “open” a lot. When vendors go after each other, you hear “proprietary” thrown about as an epithet. And when a vendor is accused of being proprietary, they often lash out into an epic battle of “your momma is so proprietary”.

Proprietary Bad!

So last week there was a discussion on Twitter  between former Cisco employee and new Dell Force10 employee Brad Hedlund (@bradhedlund), and former Cisco employee and new Juniper employee Chistopher Hoff (@beaker). (By the way, they are both people I admire and respect.)

I believe they were talking about the different approaches their respective companies were taking solve the evolving needs of modern data centers. Juniper’s solution is QFabric, while Dell Force 10 is going the NVGRE/VXLAN/OpenFlow route.  Brad cited QFabric as proprietary, and Christopher Hoff countered that Cisco’s FEX is also proprietary. And while true, something about that bothered me a bit.

QFabric and FEX are both proprietary, but the effect of the proprietary is very different. With QFabric, you can build a huge network fabric, without worrying about spanning-tree, and have one control plane for a whole mesh of switches. With FEX, you can plug what looks like a switch into a Nexus 5000 or Nexus 7000, and that switch looks like a line card on the 5000/7000. FEX affects the next hop. QFabric can affect your entire data center.

FEX is pretty limited, and honestly I think it’s fairly inconsequential in terms of its proprietariness. You can use FEX, or just hang another Cisco switch off a 5K/7K, like a Nexus 3000 (with its merchant silicon) or even an Arista or Juniper box. Even if you use FEX, the effect is limited to one switch hop away. How concerned would a designer be about the effect of proprietary FEX? Pretty much it would have little effect.

The effect of QFabric, however, is potentially far more wide ranging.

That’s no moon, that’s a data center fabric

From the Packet Pushers episode (episode 51) on QFabric, Abner Germanow talks about 500 10 Gigabit Ethernet port where QFabric makes sense, which is a pretty large investment. If you figure roughly $2,000 a port, that would make it a $1,000,000 decision. If you order enough FEX/Nexus switches, you can spend that much, but you can go step by step and back out if you want.

With the proprietary versus open debate, it’s quite understandable that Juniper is very sensitive to the word “proprietary”. However, it’s tough to classify QFabric as anything but, as Ivan Pepelnjak says, “completely proprietary“.

Right now there are several open standards, such as TRILL, SPB, OpenFlow, VXLAN, FCoE, NVGRE and others looking to solve many of the same data center problems that QFabric looks to solve. And from the looks of it Juniper has been rather dismissive of some of the open fabric standard technologies, such as the much discussedWhy TRILL Won’t Work For The Data Center” argument (requires registration, fuck you TechTarget). Juniper is also taking a wait-and-see approach to VXLAN.

Even so, I don’t think Juniper should care if people call it proprietary. Yes, it’s proprietary. And yes, the effect of this proprietary-ness is huge compared to Cisco’s FEX because it affects more of the data center. But that’s a good thing.

Right now, because these open standards are mostly brand-spanking new, and no one is bat-shit crazy enough to build a multi-vendor fabric based on these new standards.

OK, maybe there is someone is crazy enough to build a multi-vendor Ethernet fabric

So QFabric has the advantage there, since even open standards are likely to be vendor-locked for now. And QFabric is a bit more mature than most of the new standards, in that it’s at least impelemtend and released. (Despite the terrible, and I mean just awful PR move bashing Juniper. Seriously, Cisco, that shit reeks sophomoric desperation. I feel cheep even linking it.)

What we do have to consider, however, is that in time the interoperability and maturity situation will be different, as it is for mature open standards today. It’s very common to have multi-vendor 802.1Q, OSPF, IS-IS, BGP, and spanning-tree deployments, without thinking twice about it. There will likely be a day when whatever new standards we’re dealing with now succeed and evolve to the point where we wouldn’t think twice about building say a TRILL fabric with multiple vendors like we do now with spanning-tree.

So QFabric is proprietary, and is not going to play well with others. That doesn’t discount it as a solution, but it is a serious consideration, more so than something like proprietary FEX. Proprietary has its advantages, and disadvantages, and the effect can be substantial or inconsequential, all factors to consider. I won’t even hazard a guess at this point as to how it’s going to play out, but like a good twitter battle, I’m going to enjoy watching.