Ethernet Congestion: Drop It or Pause It

Congestion happens. You try to put a 10 pound (soy-based vegan) ham in a 5 pound bag, it just ain’t gonna work. And in the topsy-turvey world of data center switches, what do we do to mitigate congestion? Most of the time, the answer can be found in the wisdom of Snoop Dogg/Lion.

dropitlikephraell

Of course, when things are fine, the world of Ethernet is live and let live.

everythingisfine

We’re fine. We’re all fine here now, thank you. How are you?

But when push comes to shove, frames get dropped. Either the buffer fills up and tail drop occurs, or QoS is configured and something like WRED (Weight Random Early Detection) kicks in to proactively drop frames before taildrop can occur (mostly to keep TCP’s behavior from causing spiky behavior).

buffertaildrop

The Bit Grim Reaper is way better than leaky buckets

Most congestion remediation methods involve one or more types of dropping frames. The various protocols running on top of Ethernet such as IP, TCP/UDP, as well as higher level protocols, were written with this lossfull nature in mind. Protocols like TCP have retransmission and flow control, and higher level protocols that employ UDP (such as voice) have other ways of dealing with the plumbing gets stopped-up. But dropping it like it’s hot isn’t the only way to handle congestion in Ethernet:

stophammertime

Please Hammer, Don’t PAUSE ‘Em

Ethernet has the ability to employ flow control on physical interfaces, so that when congestion is about to occur, the receiving port can signal to the sending port to stop sending for a period of time. This is referred to simply as 802.3x Ethernet flow control, or as I like to call it, old-timey flow control, as it’s been in Ethernet since about 1997. When a receive buffer is close to being full, the receiving side will send a PAUSE frame to the sending side.

PAUSEHAMMERTIME

Too legit to drop

A wide variety of Ethernet devices support old-timey flow control, everything from data center switches to the USB dongle for my MacBook Air.

Screen Shot 2013-02-01 at 6.04.06 PM

One of the drawbacks of old-timey flow control is that it pauses all traffic, regardless of any QoS considerations. This creates a condition referred to as HoL (Head of Line) blocking, and can cause higher priority (and latency sensitive) traffic to get delayed on account of lower priority traffic. To address this, a new type of flow control was created called 802.1Qbb PFC (Priority Flow Control).

PFC allows a receiving port send PAUSE frames that only affect specific CoS lanes (0 through 7). Part of the 802.1Q standard is a 3-bit field that represents the Class of Service, giving us a total of 8 classes of service, though two are traditionally reserved for control plane traffic so we have six to play with (which, by the way, is a lot simpler than the 6-bit DSCP field in IP). Utilizing PFC, some CoS values can be made lossless, while others are lossfull.

Why would you want to pause traffic instead of drop traffic when congestion occurs?

Much of the IP traffic that traverses our data centers is OK with a bit of loss. It’s expected. Any protocol will have its performance degraded if packet loss is severe, but most traffic can take a bit of loss. And it’s not like pausing traffic will magically make congestion go away.

But there is some traffic that can benefit from losslessness, and and that just flat out requires it. FCoE (Fibre Channel of Ethernet), a favorite topic of mine, requires losslessness to operate. Fibre Channel is inherently a lossless protocol (by use of B2B or Buffer to Buffer credits), since the primary payload for a FC frame is SCSI. SCSI does not handle loss very well, so FC was engineered to be lossless. As such, priority flow control is one of the (several) requirements for a switch to be able to forward FCoE frames.

iSCSI is also a protocol that can benefit from pause congestion handling rather than dropping. Instead of encapsulating SCSI into FC frames, iSCSI encapsulates SCSI into TCP segments. This means that if a TCP segment is lost, it will be retransmitted. So at first glance it would seem that iSCSI can handle loss fine.

From a performance perspective, TCP suffers mightily when a segment is lost because of TCP congestion management techniques. When a segment is lost, TCP backs off on its transmission rate (specifically the number of segments in flight without acknowledgement), and then ramps back up again. By making the iSCSI traffic lossless, packets will be slowed down during congestions but the TCP congestion algorithm wouldn’t be used. As a result, many iSCSI vendors recommend turning on old-timey flow control to keep packet loss to a minimum.

However, many switches today can’t actually do full losslessness. Take the venerable Catalyst 6500. It’s a switch that would be very common in data centers, and it is a frame murdering machine.

The problem is that while the Catalyst 6500 supports old-timey flow control (it doesn’t support PFC) on physical ports, there’s no mechanism that I’m aware of to prevent buffer overruns from one port to another inside the switch. Take the example of two ingress Gigabit Ethernet ports sending traffic to a single egress Gigabit Ethernet port. Both ingress ports are running at line rate. There’s no signaling (at least that I’m aware of, could be wrong) that would prevent the egress ports from overwhelming the transmit buffer of the ingress port.

congestion

Many frames enter, not all leave

This is like flying to Hawaii and not reserving a hotel room before you get on the plane. You could land and have no place to stay. Because there’s no way to ensure losslessness on a Catalyst 6500 (or many other types of switches from various vendors), the Catalyst 6500 is like Thunderdome. Many frames enter, not all leave.

thunderdome

Catalyst 6500 shown with a Sup2T

The new generation of DCB (Data Center Bridging) switches, however, use a concept known as VoQ (Virtual Output Queues). With VoQs, the ingress port will not send a frame to the egress port unless there’s room. If there isn’t room, the frame will stay in the ingress buffer until there’s room.If the ingress buffer is full, it can have signaled the sending port it’s connected to to PAUSE (either old-timey pause or PFC).

This is a technique that’s been in used in Fibre Channel switches from both Brocade and Cisco (as well as others) for a while now, and is now making its way into DCB Ethernet switches from various vendors. Cisco’s Nexus line, for example, make use of VoQs, and so do Brocade’s VCS switches. Some type of lossless ability between internal ports is required in order to be a DCB switch, since FCoE requires losslessness.

DCB switches require lossless backplanes/internal fabrics, support for PFC, ETS (Enhanced Transmission Selection, a way to reserve bandwidth on various CoS lanes), and DCBx (a way to communicate these capabilities to adjacent switches). This makes them capable of a lot of cool stuff that non-DCB switches can’t do, such as losslessness.

One thing to keep in mind, however, is when Layer 3 comes into play. My guess is that even in a DCB switch that can do Layer 3, losslessness can’t be extended beyond a Layer 2 boundary. That’s not an issue with FCoE, since it’s only Layer 2, but iSCSI can be routed.

Goals for 2013

As the year closes, and it turns out the world didn’t end, it’s time to start planning for 2013 (especially since I don’t know when the next doomsday is supposed to be).

My 2012 in review:

  • Obtained CCNA Data Center (possibly the first outside of Cisco, literally days after it was available)
  • Obtained CCNP Data Center (probably not the first, I know I tied with one guy at least)
  • Didn’t pass the CCIE Data Center written (beta or actual)
  • Ran a marathon in Australia (continent number 4 for marathons, shooting for all 7)
  • Saw a total solar eclipse (part of the previous trip)
  • Australia is the 30th country that I’ve visited (and I’m not counting airport layovers, such as Egypt and Japan)
  • Did more aerobatic pilot training

IMG_3831

Fruity drinks with Kurt Bales in Australia in 2012

May career goals for 2013:

  • Pass CCIE Data Center written in Janurary
  • Obtain CCIE Data Center in 2013
  • Obtain VCAP-DCA
  • ABL (Always Be Learning)

inverted

Flying a plane upside down in 2012

I think career wise, getting CCIE DC and VCAP-DCA are plenty enough for a 12-month span, as both are very tall orders. And though ambitious, with the current support system I have and resources publicly (such as vBrownbag) and that I have through Firefly, they’re both doable for 2013. I’ve got some thoughts on that particular combination of certifications which I’ll go into in another post.

There are a couple of technologies that look exciting for 2013 that I’d like to take a (closer) look at. Openstack for one, and how it relates to data center as I have only a vague conceptual understanding of it. VXLAN, STT in VMware, NVGRE in Windows 2012 Server, and the overlay technologies in general. Checking out the other hypervisor vendors, especially (and the condescending Unix administrator in me is going to throw up a bit in my mouth when I say this) Hyper-V 3.

So those are my goals for 2013. Yours?

#CCNPDC CCNP Data Center: A Slightly Longer Journey

As of Monday December 10th (12/10/2012) I’m now officially CCNP Data Center certified (though my cert didn’t show up in the system until Saturday). 

Image

 

To get the CCNP Data Center certification, you need to pass four of six exams in two available combinations. 

First, you must pass DCUCI and DCUFI. Currently, you can take either the version 4 or version 5 of those tests. 

  • DCUCI (642-994 v4/642-995 v5)
  • DCUFI (642-992 v4/642-997 v5)

Currently, passing either versions will work towards a CCNP DC, and they don’t have to be the same, i.e. you can pass the version 4 of DCUCI, and the version 5 of DCUFI, etc. After Feb 23rd though, only the version 5 works. 

Two more tests need to be passed (they don’t need to be done in any specific order), and you have two options: You can pass the troubleshooting tests, or the design tests. I opted for the design tests, though I did pass the UCS troubleshoot test (though I figured out later it was an older version, and didn’t count towards CCNP DC). 

Design Path

  • DCUCD version 4 or 5 (642-993 or 642-991)
  • DCUFD version 4 or 5 (642-991 or 642-996)

Troubleshoot Path

  • DCUCT v5.0 (642-035)
  • DCUFT v5.0 (652-980)

I’ll likely need to take DCUCT anyway, since I’d like to start teaching the UCS troubleshooting course. I did actually pass the troubleshooting test, but it was version 4, which doesn’t count. D’oh. 

These exams have been around for a while in one form or another, so they’re not brand new like the exams for CCNA Data Center. In fact, there are those that will probably automatically trigger a CCNP Data Center certification when they pass the CCNA Data Center tests, because they’ve already done the NP tests. Especially among the CCSI (Cisco Certified Systems Instructor crowd). 

These are expensive tests, as the price to sit exams has gone up. The CCNA Data Center tests are $250 a piece, so it’s $500 just to get your CCNA Data Center (assuming you pass on the first try).  The CCNP Data Center exams are $200 a piece, so that’s another $800, assuming you pass on the first try (I didn’t). I got DCUCI on my third try, and passed the rest on the first try. So all told I spent $1,700 to go through CCNP Data Center. Some of the exams will be reimbursed by Firefly, as I need some of them to continue teaching (my old DCUCI was about to expire, for example). Still, I’m covering at least part of that $1,700 out of pocket. 

I have some more to add in a bit about the experience, as well as “why” and “how”. More later. For for now:

Image

How I Would Feel Taking One More Exam…

 

#CCNADC CCNA Data Center (my short journey)

On Monday I think it was, Cisco announced the completion of the Data Center track: The CCNA Data Center and CCNP Data Center certifications, and tests are available immediately. And you know me, I live in PearsonVUE test centers, and I’m a data center nut, so I signed that shit right up.

I’m now CCNA Data Center certified.

CCNA Data Center in less than a week of it coming out

Took the first test (640-911) on Wednesday 11/21/12 (first day I could schedule) and passed with an 830. I booked the next available date (today 11/24/12) for the 640-916 test and passed, squeaking by with a 798 (797 required).

How I felt when I saw that I passed by one point

I found 640-911 tougher, and thought I got more answers wrong. 640-916 seemed easier, since it’s more of the topics I teach on a regular basis (UCS, ACE, Fibre Channel). But for some reason I scored higher on the 640-911. Go figure.

I took them both blind, without studying or reading up (and no, no “study guides”). I didn’t even look at the exam topics for 640-911, and I barely glanced at them for 640-916. Generally, the questions were all data center specific, and covered topics you’d find in the various non-track (specialization) data center certs from Cisco. Also, I’ve gotten the question “Is there WAAS on the CCNA Data Center?” It’s not in the exam topics, and I don’t think I’m violating the confidentiality agreement by confirming the exam topics list by saying no, there’s no WAAS. Thankfully, because ugh WAAS.

So why take the trouble for a CCNA Data Center when I’m working on the CCIE Data Center? The reason is the CCNP Data Center. To get the CCNP Data Center, I need the CCNA Data Center. My goal is CCIE Data Center, but I’m impatient. There are very limited seats for the CCIE Data Center because right now, I think there’s only a single pod for the entire world (I think CCIE Wireless is like that too, or at least it was when it started out). Thus it’ll be a while before I get it (I’m guessing Summer 2013), even assuming I make it on the first try (which, odds are, I won’t). My highest Cisco certification is a CCSI, which is the teaching certification. I don’t have an NP-level at all, having dropped pursuit of my CCNP R&S a while ago in pursuit of other certs.

So by January I hope to have the CCNP Data Center hammered out. I’ve already got one of the tests done (DCUCI from like, ages ago), and I can’t recall if I did DCUCD or not. I need DCUFI and DCUFD, both of which I need to get anyway. Plus one of the troubleshooting (DCUFTS/DCUCTS) and I’ll be a CCNP Data Center.

Edit (11/25/12): Turns out my DCUCI pass won’t cut it. It’s an older version of the test, and they need either the V4 or the V5. So I’m back to square zero. Also, I got the required tests wrong:

You need to pass only four exams.

You have to pass DCUCI and DCUFI (V4 or V5), and you can either do the two design exams (DCUCD and DCUFD) or do the two troubleshooting exams (DCUCT and DCUFT). In all likelihood, I’ll end up doing all 6 tests because I’m a Cisco instructor and I need certs like woah, but I think I’ll go design first.

Overall, I’m very pleased that Cisco now has a full data center track. They’ve had several specializations, but unless you’re an instructor like me or have a partner-level requirement, those certs are pretty much worthless career wise. They have zero brand recognition. For example, if I told you I’m a Cisco Data Center Application Services Support Specialist, would you care? Probably not. You’ve never heard of it, so you have no idea how difficult/easy it is.  That’s the benefit of a CCIE, since it has probably the best brand recognition of any certification in any genre of IT. Whether you’re a Linux admin, Microsoft developer, or Juniper router jockey, you likely are aware of the CCIE (and the difficulty associated with it). CCNP is not too far down that list either.

So, onward to the CCNP Data Center.

Get Yours Hands Off My HDD

Ever since I first had a device boot via SSD, I’ve been a huge fan and proponent. I often say SSDs enjoy the Charleton Heston effect: “You’ll pull my SSD out of my cold, dead hands.”

They’re just absolutely fantastic for desktop operating systems. Nothing you can do will make your desktop or laptop respond faster than adding an SSD for boot/applications. Even a system a couple years old with an SSD will absolutely run circles around a brand new system that’s still rocking the HDD.

And the prices? The prices are dropping faster than American Airline’s reputation. Currently you can get great SSDs for less than $1 per gig. Right now the sweet spot is a 256 GB SSD, though the 480/512 GB are coming down as well.

Desktop operating systems are very I/O intensive, especially with respect to IOPS, and that’s where SSDs shine. Your average 5400 RPM laptop drive gives about 60 IOPS, while a decent SSD gives you about 20,000 (more for reads). So unless you’re going to strap 300+ drives to your laptop (man your battery life would suck), you’re not going to get the same performance as you would on an SSD. Not even close. And it doesn’t matter if you’re SATA 2 or SATA 3 on your motherboard (or even SATA 1), the SSD’s primary benefit of super-IOPs won’t be restricted by SATA bandwidth.

So right now there are two primary drawbacks: Costs a bit more and the storage is less than you would get with a HDD. But boy, do you get the IOPS.

However, lately I’ve heard a few people express hesitance (and even scorn) towards SSD. “When you have an SSD go tits up, then you’ll wish you had a hard drive” is something I’ve heard recently.

Three of the biggest issues I see are:

1: Fear of running out of writes: SSDs have a limited write lifespan. Each cell can only be written to a number of times, and when that limit is reached, the cell is read-only. Modern SSD controllers do tricks like wear leveling

2: Data retrieval: If the SSD fails, there are no methods for retrieving data. There are lots of ways you can attempt to recover data from a failed disk of spinning rust (though nothing guaranteed), but no such options exist for SSDs that I’m aware of.

3: SSDs lie: SSDs do lie to you. They tell you that you wrote to a particular block that doesn’t actually correspond to a physical cell like it would a sector/track on a physical drive. This is because SSDs do wear-leveling, to ensure the longest possible lifespan of the SSD. Otherwise the blocks where the swap is stored would wear out far quicker than the rest of the drive. Our file systems (NTFS, Ext4, even ZFS) were all built on the abilities and limitations of spinning rust, and haven’t caught up to flash memory. As a result, the SSD controller has to lie to us, and pretend it’s a spinning disk.

Here’s a few things to keep in mind.

1: Yes, SSDs have a limited lifespan. The Crucial M4 has a limited write life of 36 TB, which is 20 GB a day for five years. You probably don’t write that much data to your SSD every day. And the worst that happens when your drive reaches the limit is that it becomes read-only. I don’t trust HDDs that are older than 4 or 5 years anyway.

2: True, if your SSD fails, there’s little chance of recovery (while there’s some chance of recovery if it’s a HDD). This highlights the need for a decent backup mechanism. Don’t let the chance that you could retrieve data from a HDD be your backup plan.

3: Yes, SSDs lie. So do HDDs.

I still use HDDs for media storage, backups, and archival. But apps and OS, that’s definitely going to sit on an SSD from now on. It’s just too awesome. And if that means I have to swap them out every 5 years? I’m fine with that.

Citrix and Cisco

The rumor mill turns out to be pretty accurate. Cisco announced today in Spain that they’re partnering with Citrix for a number of items, including integrating NetScaler as their next generation load balancer in with other network services (vWAAS, ASA, Nexus 1000v). Citrix has also announced a trade-in program called AMP to help with/encourage migration to NetScaler. It looks like Citrix will be taking the reigs, and it’s mostly a Citrix sale/deployment.

The announcement was light on details, and many questions remain. Will it be an OEM deal? Just a reseller deal, or “hey go talk to Citrix and buy their stuff”. Will it involve their physical devices or virtual appliance (I suspect both).

So for the first time in almost 15 years, Cisco is not in the load balancer business.

Latest Rumors: Cisco to license/puchase NetScaler?

I feel like I’ve become the TMZ of Cisco load balancer gossip, and as much as I’d like to stop, I’ve got some more rumors for y’all.

Cisco! Cisco! Cisco! Is it true you’re having a love child with Citrix?

I’ve heard from a number of unofficial non-Cisco sources that Cisco is in talks to do something with NetScaler, and something will be announced soon. Some of the stock analysis sites (which first reported the impending death of ACE) have picked up the rumors, and so has Network World.

The rumors have been anything from Cisco buying NetScaler from Citrix to an OEM agreement, to a sales agreement where Cisco sales sells Citrix as part of their data center offerings. So we’ll see what happens.

 

 

As The Datacenter Turns…

This whole ACE thing has had more twists and turns than a daytime soap opera, or perhaps a vampire franchise on the CW aimed at teens and young adults. And things keep getting more interesting. Greg Ferro recently talked to a Cisco official about the ACE, a discussion I believe started by a comment over at Brad Casemore’s blog by a Cisco representative insisting that no, ACE is not dead. Meanwhile, the folks over at Cisco WAAS are very eager to let you know that they have a pulse, and aren’t going anywhere. This seemed necessary as WAAS has been long associated with ACE (I think they shared a business unit at one point), and has been eyed as another potential Cisco market exit. Plus it didn’t help that the WAAS group has recently been rumored to have had massive layoffs.

Calculon on learning the ACE, his fiance, is on life support, and is actually his sister. Also, double-amnesia.

With the ACE in abandoned-but-not-discontinued limbo, speculating is rampant about Cisco’s next move. I think they’re still working on what do to next, and I think the ACE discontinuation got outed quicker than they expected (again, no inside knowledge here). They could do what Juniper did, and just drop out entirely and partner with vendors that have a better product. The obvious partnership would be F5, assuming Cisco could swallow its pride. A10 is another, and also a purchase target since they’re privately held, though I think neither are likely. There are a lot of Riverbed Stingray fans showing up in the comments section of mine and other articles, but since Cisco is still actively competing with Riverbed in the WOC space, that seems especially unlikely. They could end up buying a part of another company, such as Citrix’s NetScaler business. Radware could also be purchased, but they have near zero footprint in the US, and not a great reputation. We’ll have to wait and see, I’m sure there will be more twists and turns.

Requiem for the ACE

Ah, the Cisco ACE. As we mourn our fallen product, I’ll take a moment to reflect on this development as well as what the future holds for Cisco and load balancing/ADC. First off, let me state I have no inside knowledge of what Cisco’s plans are in this regard. While I teach Cisco ACE courses for Firefly and develop Firefly’s courseware for both ACE products and bootcamp material for the CCIE Data Center, I’m not an employee of Cisco and have no inside knowledge of their plans. As a result, I’ve no idea what Cisco’s plans are, so this is pure speculation.

Also, it should be made clear that Cisco has not EOL’d (End of Life) or even EOS’d (End of Sale) the ACE product, and in a post on the CCIE Data Center group Walid Issa, the project manager for CCIE Data Center, made a statement reiterating this. And just as I was about to publish this post, there’s a great post by Brad Casemore also reflecting on the ACE, and there’s an interesting comment from Steven Schuchart of Cisco (analyst relations?) making a claim that ACE is, in fact, not dead.

However, there was a statement Cisco sent to CRN confirming the rumor, and my conversations with people inside Cisco have confirmed that yes, the ACE is dead. Or at least, that’s the understanding of Cisco employees in several areas. The word I’m getting will be bug-fixed and security-fixed, but further development will halt. The ACE may not officially be EOL/EOS, but for all intents and purposes, and until I hear otherwise, it’s a dead-end product.

The news of ACE’s probable demise was kind of like a red-shirt getting killed. We all knew it was coming, and you’re not going to see a Spock-like funeral, either. 

We do know one thing: For now at least, the ACE 4710 appliance is staying inside the CCIE Data Center exam. Presumably in the written (I’ve yet to sit the non-beta written) as well as in the lab. Though it seems certain now that the next iteration (2.0) of the CCIE Data Center will be ACE-less.

Now let’s take a look down memory land, to the Ghosts of Load Balancers Past…

Ghosts of Load Balancers Past

As many are aware, Cisco has long had a long yet… imperfect relationship with load balancing. This somewhat ironic considering that Cisco was, in fact, the very first vendor to bring a load balancer to market. In 1996, Cisco released the LocalDirector, the world’s first load balancer. The product itself sprung from the Cisco purchase of Network Translation Incorporated in 1996, which also brought about the PIX firewall platform.

The LocalDirectors did relatively well in the market, at least at first. It addressed a growing need for scaling out websites (rather than the more expensive, less resilient method of scaling up). The LocalDirectors had a bit of a cult following, especially from the routing and switching crowd, which I suspect had a lot to do with its relatively simple functionality: For most of its product life, the LocalDirector was just a simple Layer 4 device, and only moved up the stack in the last few years of its product life. While other vendors went higher up the stack with Layer 7 functionality, the LocalDirector stayed Layer 4 (until near the end, when it got cookie-based persistence). In terms of functionality and performance, however,  vendors were able to surpass the LocalDirector pretty quickly.

The most important feature that the other vendors developed in the late 90s was arguably cookie persistence. (The LocalDirector didn’t get this feature until about 2001 if I recall correctly.) This allowed the load balancer to treat multiple people coming from the same IP address as separate users. Without cookie-based persistence, load balancers could only do persistence based on an IP address, and was thus susceptible to the AOL megaproxy problem (you could have thousands of individual users coming from a single IP address). There was more than one client in the 1999-2000 time period where I had to yank out a LocalDirector and put in a Layer 7-capable device because of AOL.

Cookie persistence is a tough habit to break

At some point Cisco came to terms with the fact that the LocalDirector was pretty far behind and must have concluded it was an evolutionary dead end, so it paid $6.7 billion (with B) to buy ArrowPoint, a load balancing company that had a much better product than the LocalDirector. That product became the Cisco CSS, and for a short time Cisco was on par with other offerings from other vendors. Unfortunately, as with the LocalDirector, development and innovation seemed to stop after the purchase, and the CSS was forever a product frozen in the year 2000. Other vendors innovated (especially F5), and as time went on the CSS won fewer and fewer deals. By 2007, the CSS was largely a joke in load balancing circles. Many sites were happily running the CSS of course, (and some still are today), but feature-wise, it was getting its ass handed to it by the competition.

The next load balancer Cisco came up with had a very short lifecycle. The Cisco CSM (Content Switch Module), a load balancing module for the Catalyst 6500 series, didn’t last very long and as far as I can remember never had a significant install base. Also, I don’t recall ever using, and know it only through legend (as being not very good). It was replaced quickly by the next load balancing product from Cisco.

And that brings us to the Cisco ACE. Available in two iterations, the Service Module and the ACE 4710 Appliance, it looked like Cisco might have learned from its mistakes when it released the Cisco ACE. Out of the gate it was a bit more of a modern load balancer, offering features and capabilities that the CSS lacked, such as a three-tired VIP configuration mechanism (real servers, server farms, and VIPs, which made URL rules much easier) and the ability to insert the client’s true-source IP address in an HTTP header in SNAT situations. The latter was a critical function that the CSS never had.

But the ACE certainly had its downsides. The biggest issue is that the ACE could never go toe-to-toe with the other big names in load balancing in terms of features. F5 and NetScaler, as well as A10, Radware, and others, always had a far richer feature set than the ACE. It is, as Greg Ferro said, a moderately competent load balancer in that it does what it’s supposed to do, but it lacked the features the other guys had.

The number one feature that keeps ACE from eating at the big-boy table is an answer to F5’s iRules. F5’s iRules give a huge amount of control over how to load balance and manipulate traffic. You can use it to create a login page on the F5 that authenticates against AD(without ever touching a web server), re-write http:// URLs to https:// (very useful in certain SSL termination setups), and even calculate Pi everytime someone hits a web page. Many of the other high end vendors have something similar, but F5’s iRules is the king of the hill.

In contrast, the ACE can evaluate existing HTTP headers, and can manipulate headers to a certain extent, but the ACE cannot do anything with HTTP content. There’s more than one installation where I had to replace the ACE with another load balancer because of that issue.

The ACE never had a FIPS-compliant SSL implementation either, which prevented the ACE from being in a lot of deals, especially with government and financial institutions. ACE was very late to the game with OCSP support and IPv6 (both were part of the 5.0 release in 2011), and the ACE10 and ACE20 Service Modules will never, ever be able to do IPv6. You’d have to upgrade to the ACE30 Module to do IPv6, though right now you’d be better off with another vendor.

For some reason, Cisco decided to make use of MQC (Module QoS CLI) as the configuration framework in the ACE. This meant configuring a VIP required setting up class-maps, policy-maps, and service-policies in addition to real server and server farms. This was far more complicated than the configuring of most of the competition, despite the fact that the ACE had less functionality. If you weren’t a CCNP level or higher, the MQC could be maddening. (On the upside, if you mastered it on the ACE, QoS was a lot easier to learn, as was my case.)

If the CLI was too daunting, there was always the GUI on the ACE 4710 Appliance and/or the ACE Network Manager (ANM), which was separate user interface that ran on RedHat and later became it’s own OVA-based virtual appliance. The GUI in the beginning wasn’t very good, and the ACE Service Modules (ACE10, ACE20, and now the ACE30) lacked a built-in GUI. Also, when it hits the fan, the CLI is the best way to quickly diagnose an issue. If you weren’t fluent in the MQC and the ACE’s rather esoteric utilization of such, it was tough to troubleshoot.

There was also a brief period of time when Cisco was selling the ACE XML Gateway, a product obtained through the purchase of Reactivity in 2007, which provided some (but not nearly all) of the features the ACE lacked. It still couldn’t do something like iRules, but it did have Web Application Firewall abilities, FIPS compliance, and could do some interesting XML validation and other security. Of course, that product was short lived as well, and Cisco pulled the plug in 2010.

Despite these short comings, the ACE was a decent load balancer. The ACE service module was a popular service module for the Catalyst 6500 series, and could push up to 16 Gbps of traffic, making it suitable for just about any site. The ACE 4710 appliance was also a popular option at a lower price point, and could push 4 Gbps (although it only had (4) 1 Gbit ports, never 10 Gbit). Those that were comfortable with the ACE enjoyed it, and there are thousands of happy ACE customers with deployments.

But “decent” isn’t good enough in the highly competitive load balancing/ADC market. Industry juggernauts like F5 and scrappy startups like A10 smoke the ACE in terms of features, and unless a shop is going all-Cisco, the ACE almost never wins in a bake-off. I even know of more than one occasion where Cisco had to essentially invite itself to a bake-off (which in those cases never won). The ACE’s market share continued to drop from its release, and from what I’ve heard is in the low teens in terms of percentage, while F5 has about 50%.

In short, the ACE was the knife that Cisco brought to the gunfight. And F5 had a machine gun.

I’d thought for years that Cisco might just up and decide to drop the ACE. Even with the marketing might and sales channels of Cisco, the ACE could never hope to usurp F5 with the feature set it had. Cisco didn’t seem committed to developing new features, and it fell further behind.

Then Cisco included ACE in the CCIE Data Center blueprint, so I figured they were sticking with it for the long haul. Then the CRN article came out, and surprised everybody (including many in Cisco from what I understand).

So now the big question is whether or not Cisco is bowing out of load balancing entirely, or coming out with something new. We’re certainly getting conflicting information out of Cisco.

I think both are possible. Cisco has made a commitment (that they seem to be living up to) to drop businesses and products that they aren’t successful in. While Cisco has shipped tens of thousands of load balancing units since the first LocalDirector was unboxed, except for the beginning they’ve never led the market. Somewhere in the early 2000s, that title belong almost exclusively to F5.

For a company as broad as Cisco is, load balancing as a technology is especially tough to sell and support. It takes a particular skill set that doesn’t relate fully to Cisco’s traditional routing and switching strengths, as load balancing sits in two distinct worlds: Server/app development, and networking. With companies like F5, A10, Citrix, and Radware, it’s all they do, and every SE they have knows their products forwards and backwards.

The hardware platform that the ACE is based on (Cavium Octeon network processors) I think are one of the reasons why the ACE hasn’t caught up in terms of features. To do things like iRules, you need fast, generalized processors. And most of the vendors have gone with x86 cores, and lots of them. Vendors can use pure x86 power to do both Layer 4 and Layer 7 load balancing, or some like F5 and A10 incorporate FGPAs to hardware-assist the Layer 4 load balancing, and distribute flows to x86 cores for the more advanced Layer 7 processing.

The Cavium network processors don’t have the horsepower to handle the advanced Layer 7 functionality, and the ACE Modules don’t have x86 at all. The ACE 4710 Appliance has an x86 core, but it’s several generations back (it’s seriously a single Pentium 4 with one core). As Greg Ferro mentioned, they could be transitioning completely away from that dead-end hardware platform, and going all virtualized x86. That would make a lot more sense, and would allow Cisco to add features that it desperately needs.

But for now, I’m treating the ACE as dead.

CCIE Data Center Beta Written Results Are In! (351-080)

And Cisco probably couldn’t be happier that the results are finally in. It’s been more than 3 months since the beta closed, and after a few promises of “soon”, we finally got our results today. Over at the Cisco learning community message boards for CCIE DC, there was a virtual riot going on.

Guys? I think we’d better get those results posted…

Once I got word they were live on PearsonVUE, I logged in and…. I failed.

Smug Cisco Guy: Way to go, dumbass.

At least we got our results.

To find out your status, go to PearsonVUE, log into your account, and check your history. It’ll show the pass or fail. Beyond pass/fail, we have to await the score report to find our what our weak areas were. My guess I was really weak on the 7K/5K stuff. I know I got all the ACE-related questions right, and most of the storage and UCS seemed pretty evident to me. I’ll have to wait and see, of course. I’ve scheduled a re-take for October 5th, so I’ve got some books to hit. Queue the montage…