Automation Conundrum

During Tech Field Day 8, we saw lots of great presentations from amazing companies. One presentation that disappointed, however, was Symantec. It was dull (lots of marketing) and a lot of the stuff they had to offer (like dedupe and compression in their storage file system) had been around in competing products for many years. If you use the products, that’s probably pretty exciting, but it’s not exactly making me want to jump in. And I think Robin Harris had the best description of their marketing position when he called it “Cloudwashing”. I’ve taken the liberty of creating a dictionary-style definition for Cloudwashing:

Cloudwashing: Verb. 1. The act of taking your shitty products and contriving some relevance to the blah blah cloud. 

One product that I particularly wasn’t impressed by was Symantec’s Veritas Operations Manager. It’s a suite that’s supposed to automate and report on a disparate set of operating systems and platforms, providing a single pane of glass for virtualized data center operations. “With just a few clicks on Veritas Operations Manager, you can start and stop multi-tier applications decreasing downtime.” That’s the marketing, anyway.

In reality, what they seemed to have create was an elaborate system to… automatically restart a service if it failed. You’d install this pane of glass, pay the licensing fee or whatever, configure the hooks into all your application server, web servers, database servers… and what does it do? It restarts the process if it fails. What does it do beyond that? Not much more, from what I could see in the demo. I pressed them on a few issues during the presentation (which you can see here, the Virtual Business Services part starts around the 32 minutes mark), and that’s all they seemed to have. Restarting a process.

So, not terribly useful. But I don’t think The problem is one of engineering, instead I think it’s the overall philosophy of top-down automation.

See, we all have visions of Iron Man in our heads.

Iron Man Says: Cloud That Shit

Wait, what? Iron Man?

Yes, Iron Man. In the first Iron Man movie, billionaire industrialist Tony Stark built three suits: One to get him out of the cave, the second, all silver, had an icing problem, and the third which was used in the rest of the movie to punch the shit out of a lot of bad guys. He built the first two by hand. The third however, was built through complete automation. He said “build it” and the Jarvis, his computer said: “Commencing automated assembly. Estimated completion time is five hours.”

And then he takes his super car to go to a glamorous party.

Fuck you, Tony Stark. I bet you never had to restart a service manually.

How many times have you said “Jarvis, spin up 1,000 new desktop VMs, replicate our existing environment to the standby datacenter, and resolve the last three trouble tickets” and then went off in an Audi R8 to a glamorous party? I’m guessing none.

So, are we stuck doing everything manually, by hand, like chumps? No, but I don’t believe the solution will be top-down. It will be bottom-up.

The real benefit of automation that we’re seeing today is from automating the simple tasks, not by orchestrating some amazing AI with a self-healing, self-replicating SkyNet-type singularity. It’s from automating the little mundane tasks here and there. The time savings are enormous, and while it isn’t as glamorous as having a self-healing sky-net style data center, it does give us a lot more time to do actual glamorous things.

Since I teach Cisco’s UCS blade systems, I’ll use them as an example (sorry, HP). In UCS, there is the concept of service profiles, which are an abstraction of aspects of a server that are usually tied to a physical server, and found in disparate places. Boot order (BIOS), connectivity (SAN and LAN switches), BIOS and HBA firmware (typically flashed separately and manually), MAC and WWN addresses (burnt in), and more are all stored and configured via a single service profile, and that profile is then assigned to a blade. Cisco even made a demonstration video showing they could get a new chassis with a single blade up and online with ESXi in less than 30 minutes from sitting-in-the-box to install.

The Cisco UCS system isn’t particularly intelligent, doesn’t respond dynamically to increased load, but it automates a lot of tasks that we used to have to do manually. It “lubes” the process, as Chris Sacca used the term lube in a great talk he did at Le Web 2009. I’ll take that over some overly complicated pane-of-glass solution that essentially restarts processes when they stop any day.

Perhaps at some point we’ll get to the uber-smart self-healing data center, but right now everyone who has tried has come up really, really short. Instead, there have been tremendous benefits in automating the mundane tasks, the unsexy tasks.

FCoE: I’m not Dead! Arista: You’ll Be Stone Dead in a Moment!

I was at Arista on Friday for Tech Field Day 8, and when FCoE was brought up (always a good way to get a lively discussion going), Andre Pech from Arista (who did a fantastic job as a presenter) brought up an article written by Douglas Gourlay, another Arista employee, entitled “Why FCoE is Dead, But Not Buried Yet“.

FCoE: “I feel happy!”

It’s an interesting article, because much of the player-hating seems to directed at TRILL, not FCoE, and as J Metz has said time and time again, you don’t need TRILL to do FCoE if you do FCoE the way Cisco does (by using Fibre Channel Forwarders in each FCoE switch). Arista, not having any Fibre Channel skills, can’t do it this way. If they were to do FCoE, Arista (like Juniper) would need to do it the sparse-mode/FIP-snooping FCoE way, which would need a non-STP way of handling multi-pathing such as TRILL or SPB.

Jayshree Ullal, The CEO of Arista, hated on TRILL and spoke highly of VXLAN and NVGRE (Arista is on the standards body for both). I think part of that is that like Cisco, not all of their switches will be able to support TRILL, since TRILL requires new Ethernet silicon.

Even the CEO of Arista acknowledged that FCoE worked great at the edge, where you plug a server with a FCoE CNA into an FCoE switch, and the traffic is sent along to native Ethernet and native Fibre Channel networks from there (what I call single-hop or no-hop FCoE). This doesn’t require any additional FCoE infrastructure in your environment, just the edge switch. The Cisco UCS Fabric Interconnects are a great example of this no-hop architecture.

I don’t think FCoE is quite dead, but I have to imagine that it’s not going as well as vendors like Cisco have hoped. At least, it’s not been the success that some vendors have imagined. And I think there are two major contributors to FCoE’s failure to launch, and both of those reasons are more Layer 8 than Layer 2.

Old Man of the Data Center

Reason number one is also the reason why we won’t see TRILL/Fabric Path deployed widely: It’s this guy:

Don’t let him trap you into hearing him tell stories about being a FDDI bridge, whatever FDDI is

The Catalyst 6500 series switch. This is “The Old Man of the Data Center”. And he’s everywhere. The switch is a bit long in the tooth, and although capacity is much higher on the Nexus 7000s (and even the 5000s in some cases), the Catalyst 6500 still has a huge install base.

And it won’t ever do FCoE.

And it (probably) won’t ever do TRILL/Fabric Path (spanning-tree fo-evah!)

The 6500s aren’t getting replaced in significant numbers from what I can see. Especially with the release of the Sup 2T supervisor for the 6500es, the 6500s aren’t going anywhere anytime soon. I can only speculate as to why Cisco is pursuing the 6500 so much, but I think it comes down to two reasons:

Another reason why customers haven’t replaced the 6500s are that the Nexus 7000 isn’t a full-on replacement. With no service modules, limited routing capability (it just recently got the ability to do MPLS), and a form factor that’s much larger than the 6500 (although the 7009 just hit the streets with a very similar 6500 form factor, which begs the question: Why didn’t Cisco release the 7009 first?).

Premature FCoE

So reason number two? I think Cisco jumped the gun. They’ve been pushing FCoE for a while, but they weren’t quite ready. It wasn’t until July 2011 that Cisco released NX-OS 5.2, which is what’s required to do multi-hop FCoE in the Nexus 7000s and MDS 9000. They’ve had the ability to do multi-hop FCoE in the Nexus 5000s for a bit longer, but not much. Yet they’ve been talking about multi-hop for longer than it was possible to actually implement. Cisco has had a multi-hop FCoE reference architecture posted since March 2011 on their website, showing a beautifully designed multi-hop FCoE network with 5000s, 7000s, and MDS 9000s, that for months wasn’t possible to implement. Even today, if you wanted to implement multi-hop FCoE with Cisco gear (or anyone else), you’d be a very, very early adopter.

So no, I don’t think FCoE is dead. No-hop FCoE is certainly successful (even Arista’s CEO acknowedged as such), and I don’t think even multi-hop FCoE is dead, but it certainly hasn’t caught on (yet). Will multi-hop FCoE catch on? I’m not sure. We’ll have to see.