Top 5 Reasons The Evaluator Group Screwed Up

It’s been a while since the trainwreck of a “study” commissioned by Brocade and performed by The Evaluator Group,  but it’s still being discussed in various storage circles (and that’s not good news for Brocade). Some pretty much parroted the results, seemingly without reading the actual test. Then got all pissy when confronted about it.  I did a piece on my interpretations of the results, as did Dave Alexander of WWT and J Metz of Cisco. Our mutual conclusion can be best summed up with a single animated GIF.

 

bullshit

But since a bit of time has passed, I’ve had time to absorb Dave and J’s opinions, as well as others, I’ve come up with a list of the Top 5 Reasons by The Evaluator Group Screwed Up. This isn’t the complete list, of course, but some of the more glaring problems. Let’s start with #1:

Reason #1: I Have No Idea What I’m Doing

Their hilariously bad conclusion to the higher variance in response times and higher CPU usage was that it was the cause of the software initiators. Except, they didn’t use software initiators. The had actually configured hardware initiators, and didn’t know it. Let that sink in: They’re charged with performing an evaluation, without knowing what they’re doing.

The Cisco UCS VIC 1240 hardware CNA’s were utilized.  Referring to them as software initiators caused some confusion. The Cisco VIC is a hardware initiator and we configured them with virtual HBAs. Evaluator Group has no knowledge of the internal architecture of the VIC or its driver.  Our commentary of the possible cause for higher CPU utilization is our opinion and further analysis would be required to pinpoint the specific root cause.

Of course, it wasn’t the software initiator. They didn’t use a software initiator, but they were so clueless, they didn’t know they’d actually used a hardware initiator. Without knowing how they performed their tests (since they didn’t publish their methodology) it’s purely speculation, but it looks like the problem was caused by congestion (from them architecting the UCS solution incorrectly).

Reason #2: They’re Hilariously Bad At Math.

They claimed FCoE required 50% more cables, based on the fact that there were 50% more cables in the FCoE solution than the FC solution. Which makes sense… except that the FC system had zero Ethernet.

That’s right, in the HP/Fibre Channel solution, each blade had absolutely zero Ethernet connectivity. In the Cisco UCS solution, every blade had full Ethernet and Fibre Channel connectivity.  None. Zilch. Why did they do that? Probably because had they included any network connectivity to the HP system, the cable count would have shifted to FCoE’s favor.  Let me state this again, because it’s astonishingly stupid: They claimed FCoE (which included Ethernet and FC connectivity) required more cables without including any network connectivity for the HP/FC system. 

not_even_mad

Also, they made some power/cooling claims, despite the fact that the UCS solution didn’t require a separate FC switch (it’s capable of being a full-fledged Fibre Channel switch by itself), though the HP solution would have required a separate pair of Ethernet switches (which wasn’t included). So yeah, their math is a bit off. Had they done things, you know, correctly, the power, cooling, and cable count would have flipped in favor of FCoE.

Reason #3: UCS is Hard, You Guys!

They whinged about UCS being more difficult to setup. Anytime you’re dealing with unfamiliar technology, it’s natural that it’s going to be more difficult. However, they claimed that they had zero experience with HP as well (seriously, who at Brocade hired these guys?) How easy is UCS? Here is a video done from Amsterdam where a couple of Cisco techs added a new chassis and blade and had it booted up and running ESXi in less than 30 minutes from in the box to booted. Cisco UCS is different than other blade systems, but it’s also very easy (and very quick) to stand up. And keep in mind, the video I linked was done in Amsterdam, so they were probably baked   

Reason #4: It Contradicts Everyone Else’s Results (Especially those that know what they’re doing)

For the past couple of years, VMware and NetApp have been doing performance tests on various storage protocols. Here’s one from a few years ago, which includes (native) 4 and 8 Gbit Fibre Channel, 10 Gbit FCoE, 10 Gbit iSCSI, and 10 Gbit NFS. The conclusion? The protocol doesn’t much matter. They all came out about the same when normalized for bandwidth. The big difference is in the storage backend. At least they published their methodology (I’m looking at you, Evaluator Group). Here’s one from Demartek that shows a mixture of storage protocols saturating 10 Gbit Ethernet. Again, the limitation is only the link speed itself, not the protocol. And again, again, Demartek published their methodology.

Reason #5: How Did They Set Everything Up? Magic!

Most of the time with these commissioned reports, the details of how it’s configured are given so that the results can be reproduced and audited. How did the Evaluator Group set up their environment?

GOB.MAGIC_.GIF_

As far as I can tell, magic. There’s several things they could have easily gotten wrong with the UCS setup, and given their mistake about software/hardware initiators, quite likely. They didn’t even mention which storage vendor they used.

So there you have it. A bit of a re-hash, but hey, it was a dumb report. The upside though is that it did provide me with some entertainment.

FCoE versus FC Farce (I’m Tellin’ All Y’All It’s Sabotage!)

Updates 2/6/2014:

  • @JohnKohler noticed that the UCS Manager screenshot used (see below) is from a UCS Emulator, not any system they used for testing.
  • Evaluator Group promises answers to questions that both I and Dave Alexander (@ucs_dave) have brought up.

On my way back from South America/Antarctica, I was pointed to a bake-off/performance test commissioned by Brocade and performed by a company called Evaluator Group. It compared the performance of edge FCoE (non-multi-hop FCoE) to native 16 Gbit FC. The FCoE test was done on a Cisco UCS blade system connecting to a Brocade switch, and the FC was done on an HP C7000 chassis system connecting to the same switch. At first glance, it would seem to show that FC is superior to FCoE for a number of reasons.

I’m not a Cisco fanboy, but I am a Cisco UCS fanboy, so I took great interest in the report. (I also work for a Cisco Learning Partner as an instructor and courseware developer.) But I also like Brocade, and have a huge amount of respect for many of Brocade employees that I have met over the years. These are great and smart people, and they serve their customers well. 

First, a little bit about these types of reports. They’re pretty standard in the industry, and they’re commissioned by one company to showcase superiority of a product or solution against one or more of their competitors. They can produce some interesting information, but most of the time it’s a case of: “Here’s our product in a best-case scenario versus other products in a mediocre-to-worst case scenario.” No company would release a test showing other products superior to theirs of course, so they’re only released when a particular company comes out on top, or (most likely) the parameters are changed until they do. As such, they’re typically taken with a grain of salt. In certain markets, such as the load balancer market, vendors will make it rain with these reports on a regular basis. 

But for this particular report, I found several substantial issues with it which I’d like to share. It’s kind of a trainwreck. Let’s start with the biggest issue, one that is rather embarrassing.

What The Frak?

On page 17 check out the Evaluator Group comments:

“…This indicates the primary factor for higher CPU utilization within the FCoE test was due to using a software initiator, rather than a dedicated HBA. In general, software initiators require more server CPU cycles than do hardware initiators, often negating any cost advantages.”

For one, no shit. Hardware initiators will perform better than software initiators. However, the Cisco VIC 1240 card (which according to page 21 was included in the UCS blades) is a hardware initiator card. Being a CNA (converged network adaptor) the OS would see a native FC interface. With ESXi you don’t even need to install extra drivers, the FC interfaces just show up. Setting up a software FCoE initiator would actually be quite a bit more difficult to get going, which might account for why it took so long to configure UCS. Configuring a hardware vHBA in UCS is quite easy (it can be done in literally less than a minute).

Using software initiators against hardware FC interfaces is beyond a nit-pick in a performance test. It would be downright sabotage.

sabotage

I asked the @Evalutor_Group account if they really did software initiators:

And they responded in the affirmative.

Wow. First of all, software FCoE initiators is absolutely not standard configuration for UCS. In the three years I’ve been configuring and teaching UCS, I’ve never seen or even heard of FCoE software initiators being used, either in production or in a testing environment. The only reason you *might* want to do FCoE software initiators is when you’ve got the Intel mezzanine card (which is not a CNA, just an Ethernet card), and want to test FCoE. However, on page 21 it shows the UCS blades has having the VIC 1240 cards, not the Intel card.

So where were they getting that it was standard configuration?

They countered with a reference to a UCS document regarding the VIC:

and then here:

Which is this document here.

The part they referenced is as follows:

The adapter offerings are:

■ Cisco Virtual Interface Cards (VICs)

Cisco developed Virtual Interface Cards (VICs) to provide flexibility to create multiple NIC and HBA devices. The VICs also support adapter Fabric Extender and Virtual Machine Fabric Extender technologies.

■ Converged Network Adapters (CNAs)

Emulex and QLogic Converged Network Adapters (CNAs) consolidate Ethernet and Storage (FC) traffic on the Unified Fabric by supporting FCoE.

■ Cisco UCS Storage Accelerator Adapters

Cisco UCS Storage Accelerator adapters are designed specifically for the Cisco UCS B-series M3 blade servers and integrate seamlessly to allow improvement in performance and relief of I/O bottlenecks.

Wait… I think they think that the VIC card is a software-only FCoE card. It appears they came to that conclusion because the VIC doesn’t specifically mention it’s a CNA in this particular document (other UCS documents clearly and correctly indicate that the VIC card is a CNA). Because it mentions the VICs separately from the traditional CNAs from Emulex and Qlogic, it seems they believe it not to be a CNA, and thus a software card.

So it may be they did use hardware initiators, and mistakenly called them software initiators. Or they actually did configure software initiators, and did a very unfair test.

No matter how you slice it, it’s troubling. On one hand, if they did configure software initiators, they either ignorantly or willfully sabotaged the FCoE results. If they just didn’t understand the basic VIC concept, it means they setup a test without understand the most basic aspects of the Cisco UCS system. We’re talking 101 level stuff, too.  I suspect it’s the later, but since the only configuration of any of the devices they shared was a worthless screenshot of UCS manager, I can’t be sure.

This lack of understanding could have a significant impact on the results. For instance, on page 14 the response time starts to get worse for FCoE at about the 1200 MB/s mark. That’s roughly the max for a single 10 Gbit Ethernet FCoE link (1250 MB/s). While not definitive, it could mean that the traffic was going over only one of the links from the chassis to the Fabric Interconnect, or the traffic distribution was way off. My guess is they didn’t check the link utilization, or even know how, or how to fix it if it were off.

Conclusions First?

One of the more odd aspects of this report are where you found some of the conclusions, such as this one:

“Evaluator Group believes that Fibre Channel connectivity is required in order to achieve the full benefits that solid-state storage is able to provide.”

That was page 1. The first page of the report is a little weird to make a conclusion like that. Makes you sound a little… biased. These reports usually try to be impartial (despite being commissioned by a particular vendor). This one starts right out with an opinion.

Lack of Configuration

The amount of configuration they provide for the setup very sparse. For the UCS side, all they provide is one pretty worthless screenshot. Same for the HP system. For the UCS part, it would be important to know how they configured the vHBAs, and how they configured the two 10 Gbit links from the IOM to the chassis. Where they configured for the preferred Fabric Port Channel, or static pinning? So there’s no way to duplicate this test. That’s not very transparent.

Speaking of configuration, one of the issues they had with the FCoE side was how long it took them to stand up a UCS system. On page 11, second paragraph, they mention they need the help of a VAR to get everything configured. They even made a comment on page 10:

“…this approach was less intuitive during installation than other enterprise systems Evalutor Group has tested.”

youdontsay

So wow, you’ve got an environment that you’re unfamiliar with (we’ll see just how unfamiliar in a minute), and it took you *gasp* longer to configure? I’m a bit of an expert on Cisco UCS. I’m kind of a big deal. I have many paper-bound UCS books and my apartment smells of rich mahogany. I teach it regularly. And I’m not nearly as familiar with the C7000 system from HP, so I’d be willing to bet that it would take me longer to stand up an HP system than it would a Cisco UCS system. Anyone want in on that action?

Older Version? Update 2-6-2014: Screenshot Is A Fake

itsafake

@JohnKohler noticed that on the screenshot on page 23 that the serial number in the screenshot is “1”, which means the screenshot is not from any physical instance of a UCS Manager, but the UCS Emulator. The date (if accurate on the host machine) shows June of 2010, so it’s a very, very old screenshot (probably of 1.3 or 1.4). So we have no idea what version of UCS they used for these tests (and more importantly, how they were configured). 

Based on the screenshot on page 23, the UCS version is 2.0. How can you tell? Take a look where it says “Unconfigured Ports”. As of 2.1, Cisco changed the way ports are shown in the GUI. 2.1 and later do not have a sub-menu for unconfigured ports. Only 2.0 and prior.

nounconfiguredports

In version 2.1, there’s no “Unconfigured Ports” sub-menu. 

evalgroupucs

From Page 21, you can see “Unconfigured Ports”, indicating it’s UCS 2.0 or earlier.

If the test was done in November 2013, that would put it one major revision behind, as 2.1 was released in November of 2012 (2.2 was released in Dec 2013). We don’t know where they got the equipment, but if it acquired anytime in the past year, it likely came with 2.1 already installed (but not definite). To get to 2.0, they’d have to downgrade. I’m not sure why they would have done that, and if they brought in a VAR with certified UCS people they would have likely recommended 2.1. With 2.1, they could have directly connected the storage array to the UCS fabric interconnects. UCS 2.1 can do Fibre Channel zoning, and can function as a standard Fibre Channel switch. The Brocade switch wouldn’t have been needed, and the links to the storage array would be 8 Gbit instead of 16 Gbit. 

They also don’t mention the solid state array vendor, so we don’t know if there was capability to do FCoE directly to the storage array. Though not terribly common yet, FCoE connection to a storage array is done in production environments and would have benefited the UCS configuration if it were possible (and would be the preferred way if competing with 16 Gbit FC). There would have been the ability to do an LACP port channel between the Fabric Interconnects, providing better load distribution and redundancy.

Power Mad

The claim that the UCS system uses more power is laughable, especially since they specifically mentioned this setup is not how it would be deployed in production. Nothing about this setup was production worthy, and it wasn’t supposed to be. It’s fine for this type of test, but not for production. It’s non-HA, and using only 2 blades would be a waste of a blade system (and power, for either HP or Cisco). If you were only using a handful of servers, buying pizza boxes would be far more economical. Blades from either HP or Cisco only make sense past a certain number to justify the enclosures, networking, etc. If you want a good comparison, do 16 blades, or 40 blades, or 80 blades. Also include the Ethernet network connectivity. The UCS configuration has full Ethernet connectivity, the HP configuration as shown has squat.

Even by competitive report standards, this one is utter bunk. If I was Brocade, I would pull the report. With all the technology mistakes and ill-found conclusions, it’s embarrassing for them. It’s quite easy for Cisco to rip it to shreds.

(Just so there’s no confusion, no one paid me or asked me to write this. In fact, I’m still on PTO. I wrote this because someone is wrong on the Internet…)

UCS Manager 2.1: Case of the Missing Management IP Pools

My colleague Barry Gursky was playing with the new UCS Manager Emulator for the new 2.1 release (which you can find on Cisco’s web site, you’ll need a CCO account but no special contracts from what I can tell) and he noticed something. The Management IP Pool menu was missing.

ippoolsmoved1

 

UCS Manager 2.0 and prior had the Management IP Pool in the Admin Tab

It’s normally in the Admin tab, as shown above. But sure enough, it was gone.

You need pools of IP addresses to assign to service profiles and blades for out of band management (Cisco Integrated Management Controller, gives you KVM, IPMI, etc.). Or you could assign the IPs manually, but that would be rather annoying. Pools are way easier.

We searched and finally found it moved to the LAN tab.

ippoolsmoved2

Mystery solved.

sherlockwink

NPV and NPIV

The CCIE Data Center blueprint makes mention of NPV and NPIV, and Cisco UCS also makes heavy use of both topics, topics that many may be unfamiliar with most. This post (part of my CCIE Data Center prep series) will explain what they do, and how they’re different.

(Another great post on NPV/NPIV is from Scott Lowe, and can be found here. This is a slightly different approach to the same information.)

NPIV and NPV are among the two most ill-named of acronyms I’ve come across in IT, especially since they sound very similar, yet do two fairly different things. NPIV is an industry-wide term and is short for N_Port ID Virtualization, and NPV is a Cisco-specific term, and is short for N_Port Virtualization. Huh? Yeah, not only do they sound similar, but the names give very little indication as to what they do.

NPIV

First, let’s talk about NPIV. To understand NPIV, we need to look at what happens in a traditional Fibre Channel environment.

When a host plugs into a Fibre Channel switch, the host end is called an N_Port (Node Port), and the FC switch’s interface is called an F_Port (Fabric Port). The host has what’s known as a WWPN (World Wide Port Name, or pWWN), which is a 64-bit globally unique label very similar to a MAC address.

However, when a Fibre Channel host sends a Fibre Channel frame, that WWPN is no where in the header. Instead, the host does a Fabric Login, and obtains an FCID (somewhat analagous to an IP addres). The FCID is a 24-bit number, and when FC frames are sent in Fibre Channel, the FCID is what goes into the source and destination fields of the header.

Note that the first byte (08) of the FCID is the same domain ID as the FC switch that serviced the host’s FLOGI.

In regular Fibre Channel operations, only one FCID is given per physical port. That’s it. It’s a 1 to 1 relationship.

But what if you have an ESXi host, for example, with virtual fibre channel interfaces. For those virtual fibre channel interfaces to complete a fabric login (FLOGI), they’ll need their own FCIDs. Or, what if you don’t want to have a Fibre Channel switch (such as an edge or blade FC switch) go full Fibre Channel switch?

NPIV lets a FC switch give out multiple FCIDs on a single port. Simple as that.

The magic of NPIV: One F_Port gives out multiple FCIDs on (0x070100 to the ESXi host and 0x070200 and 0x070300 to the virtual machines)

NPV: Engage Cloak!

NPV is typically used on edge devices, such as a ToR Fibre Channel switch or a FC switch installed in a blade chassis/infrastruture. What does it do? I’m gonna lay some Star Trek on you.

NPV is a clocking device for a Fibre Chanel switch.

Wait, did you just compare Fibre Channel to a Sci-Fi technology? 

How is NPV like a cloaking device? Essentially, an NPV enabled FC switch is hidden from the Fibre Channel fabric.

When a Fibre Channel switch joins a fabric, it’s assigned a Domain_ID, and participates in a number of fabric services. With this comes a bit of baggage, however. See, Fibre Channel isn’t just like Ethernet. A more accurate analogue to Fibre Channel would be Ethernet plus TCP/IP, plus DHCP, distributed 802.1X, etc. Lots of stuff is going on.

And partly because of that, switches from various vendors tend not to get a long, at least without enabling some sort of Interopability Mode. Without interopability mode, you can’t plug a Cisco MDS FC switch into say, a Brocade FC switch. And if you do use interopability mode and two different vendors in the same fabric, there are usually limitations imposed on both switches. Because of that, not many people build multi-vendor fabrics.

Easy enough. But what if you have a Cisco UCS deployment, or some other blade system, and your Fibre Channel switches from Brocade? As much as your Cisco rep would love to sell you a brand new MDS-based fabric, there’s a much easier way.

Engage the cloaking device.

(Note: NPV is a Cisco-specific term, while other vendors have NPV functionality but call it something else, like Brocade’s Access Gateway.) A switch in NPV mode is invisible to the Fibre Channel fabric. It doesn’t participate in fabric services, doesn’t get a domain ID, doesn’t do fabric logins or assign FCIDs. For all intents and purposes, it’s invisible to the fabric, i.e. cloaked. This simplifies deployments, saves on domain IDs, and lets you plug switches from one vendor into a switch of another vendor.  Plug a Cisco UCS Fabric Interconnect into a Brocade FC switch? No problem. NPV. Got a Qlogic blade FC switch plugging into a Cisco MDS? No problem, run NPV on the Qlogic blade FC switch (and NPIV on the MDS).

The hosts perform fabric logins just like they normally would, but the NPV/cloaked switch passes FLOGIs up to the NPIV enabled port on the upstream switch. The FCID of the devices bears the  the Domain ID of the upstream switch (and appears directly attached to the upstream switch).

The NPV enabled switch just proxies the FLOGIs up to the upstream switch. No fuss, no muss. Different vendors can interoperate, we save domain IDs, and it’s typically easier to administer.

TL;DR: NPIV allows multiple FLOGIs (and multiple FCIDs issued) from a single port. NPV hides the FC switch from the fabric.