Is The OS Relevant Anymore?

I started out my career as a condescending Unix administrator, and while I’m not a Unix administrator anymore, I’m still quite condescending. In the past, I’ve run data centers based on Linux, FreeBSD, Solaris, as well as administered Windows boxes, OpenBSD and NetBSD, and even NeXTSTEP (best desktop in the 90s).

Dilbert.com

In my role as a network administrator (and network instructor), this experience has become invaluable. Why? One reason is that most networking devices these days have an open sourced based operating system as the underlying OS.

And recently, I got into a discussion on Twitter (OK, kind of a twitter fight, but it’s all good with the other party) about the underlying operating systems for these network devices, and their relevance. My position? The underlying OS is mostly irrelevant.

First of all, the term OS can mean a great many things. In the context of this post, when I talk about OS I’m referring to only the underlying OS. That’s the kernel, libraries, command line, drivers, networking stack, and file system. I’m not referring to the GUI stack (GNOME, KDE, or Unity for the Unixes, Mac OS X’s GUI stack, Win32 for Window) or other types of stack such as a web application stack like LAMP (Linux, Apache, MySQL, and PHP).

Most routers and MLS (multi-layer switches, swtiches that can route as fast as they can switch) run an open source operating system as its control plane. The biggest exception is of course Cisco’s IOS, which is proprietary as hell. But IOS has reached its limits, and Cisco’s NX-OS, which runs on Cisco’s next-gen Nexus switches, is based on Linux. Arista famously runs Linux (Fedora Core) and doesn’t hide it from the users (which allows it to do some really cool things). Juniper’s Junos is based on FreeBSD.

In almost every case of router and multi-layer switch however, the operating system doesn’t forward any packets. That is all handled in specialized silicon. The operating system is only responsible for the control plane, running processes like an OSPF, spanning-tree, BGP, and other services to decide on a set of rules for forwarding incoming packets and frames. These rules, sometimes called a FIB (Forwarding Information Base), are programmed into the hardware forwarding engines (such as the much-used Broadcom Trident chipset). These forwarding engines do the actual switching/routing. Packets don’t hit the general x86 CPU, they’re all handled in the hardware. The control plane (running as various coordinated processes on top of a one of these open source operating systems) tells the hardware how to handle packets.

So the only thing the operating system does (other than the occasional punted packet) is tell the hardware how to handle traffic the general CPU will never see. This is the way it has to be, because x86 hardware can’t scale nearly as well as special purpose silicon can, especially considering power and cooling consumption. Latency is way lower as well.

In fact, hardware wise, most vendors (Juniper, Arista, Huawei, Alcatel-Lucent ,etc.) have been using the exact same chip in their latest switches. So the differentiation isn’t the silicon. Is the differentiation the underlying operating system? No, it makes little difference for the end user. They are instead a (mostly) invisible platform for which the services (CLI, APIs, routing protocols, SDN hooks, etc.) are built upon. Networking vendors are in the middle of a transition into software developers (and motherboard gluers).

All you need to create a 10 Gigabit Switch

The biggest holdout in networking devices and non-open source is of course, Cisco’s IOS, which is proprietary as hell. Still, the future for Cisco appears to be NX-OS running on all of the Nexus switches, and that’s based on Linux.

Let’s also take a look at networking devices where the underlying OS may actually touch the data plane, and a genre in which I’m very much acquatned with: Load balancers (and no, I’m not calling them Application Delivery Controllers).

F5’s venerable BIG-IPs used to be based on BSDI initially (a years-dead BSD), and then switched to Linux. CoyotePoint was based on FreeBSD, and is now based on NetBSD. Cisco’s ACE is based on Linux (although Cisco’s shitty CSS runs proprietary vxWorks, but it’s not shitty because of vxWorks). Most of the other vendors are based on Linux. However, the baseline operating system makes very little difference these days.

Most load balancers have SSL offload (to push the CPU-intensive asymmetric encryption onto a specialized processor). This is especially important as we move to 2048-bit SSL certificates. Some load balancers have Layer 2/3/4 silicon (either ASICs or FPGAs, which are flexible ASICs) to help out with forwarding traffic, and hit general CPUs (usually x86) for the Layer 7 parsing. So does the operating system touch the traffic going through a load balancer? Usually, not always, and well, it depends.

So with Cisco on Linux and Juniper with FreeBSD, would either company benefit from switching to a different OS? Does either company enjoy a competitive advantage by having chose their respective platform? No. In fact, switching platforms would likely be a colossal waist of time and resources. The underlying operating systems just provide some common services to run the networking services that program the line cards and silicon.

When I brought up Arista and their Fedora Core-based control plane which they open up to customers, here’s what someone (a BSD fan) described Fedora as: “Inconsistent and convoluted”, “building/testing/development as painful”, and “hasn’t a stable file system after 10 years”.

Reading that statement, you’d think that dealing with Fedora is a nightmare. That’s not remotely true. Some of that statement is exaggeration (and you could find specific examples to support that statement for any operating system) and some of it is fantasy. No stable file system? Linux has had several file systems, including ext2, ext3, ext4, XFS, and more for a while, and they’ve been solid.

In a general sense, I think the operating system is less relevant than it used to be. Take OpenBSD for example. It’s well deserved reputation for security is legendary. Still, would there be any advantage today to running your web application stack on OpenBSD? Would your site be any more secure? Probably not. Not because OpenBSD is any less secure today than it was a while ago, quite the opposite. It’s because the attack vectors have changed. The attacks are hitting the web stack and other pieces rather than the underlying operating system. Local exploits aren’t that big of deal because few systems let anyone but a few users log in anyway. The biggest attacks lately have come from either SQL injection or attacks on desktop operating systems (mostly Windows, but now recently Apple as well).

If you’re going to expose a server directly to the Internet on a DMZ or (gasp) without any firewall at all, OpenBSD is an attractive choice. But that doesn’t happen much anymore. Servers are typically protected by layers of firealls, IPS/IDS, and load balancers.

Would Android be more successful or less successful if Google switched from Linux as the underpinnings to one of the BSDs? Would it be more secure if they switched to OpenBSD? No, and it would it be an entirely wasted effort. It’s not likely any of the security benefits of OpenBSD would translate into the Dalvik stack that is the heart of Android.

As much as fanboys/girls don’t want to admit it, it’s likely the number one reason people choose an OS is familiarity. I tend to go with Linux (although I have FreeBSD and OpenBSD-based VMs running in my infrastructure) because I’m more familiar with it. For my day to day uses, Linux or FreeBSD would both work. There’s not a competitive advantage either have over each other in that regard. Linux outright wins in some cases, such as virtualization (BSDs have been very behind in that technology, though they run fine as guests), but for most stuff it doesn’t matter. I use FreeNAS, which is FreeBSD based, but I don’t care what it runs. I’d use FreeNAS if it were based on Linux, OpenBSD, or whatever.  (Because it’s based on FreeBSD, FreeNAS does run ZFS, which for some uses is better than any of the Linux file systems, although I don’t run FreeNAS’s ZFS since it’s missing encryption).

So fanboy/girlism aside, for the most part today, choice of an operating system isn’t the huge deal it may once have been. People succeed with using Linux, FreeBSD, OpenBSD, NetBSD, Windows, and more as the basis for their platforms (web stack, mobile stack, network device OS, etc.).

TRILLapalooza

If there’s one thing people lament in the routing and switching world, it’s the spanning tree protocol and the way Ethernet forwarding is done (or more specifically, it’s limitations). I’ve made my own lament last year (don’t cross the streams), and it’s come up recently in Ivan Pepelnjak’s blog.  Even server admins who’ve never logged into a switch in their lives know what spanning-tree is: It is the destroyer of uptime, causer of the Sev1 events, a pox among switches.

I am become spanning-tree, destroyer of networks

The root of the problem is the way Ethernet forwarding is done: There can’t be more than one path for an Ethernet frame to take from a source MAC to a destination MAC. This basic limitation has not changed for the past few decades.

And yet, for all the outages, all of the configuration issues and problems spanning-tree has caused, there doesn’t seem to be much enthusiasm for the more fundamentals cures: TRILL (and the current proprietary implementations), SPB, QFabric, and to a lesser extent OpenFlow (data center use cases), and others. And although the OpenFlow has been getting a lot of hype, it’s more because VCs are drooling over it than from its STP-slieghing ways.

For a while in the early 2000s, it looked like we might get rid of it for the most part. There was a glorious time when we started to see multi-layer switches that could route Layer 3 as fast as they could switch Layer 2, giving us the option of getting rid spanning-tree entirely. Every pair of switches, even at the access layer, would be it’s own Layer 3 domain. Everything was routed to everywhere, and the broadcast domains were very small so there wasn’t a possibility for Ethernet to take multiple paths. And with Layer 3 routing, multi-pathing was easy through ECMP. Convergence on a failed link was way faster than spanning tree.

Then virtualization came, and screwed it all up. Now Layer 3 wasn’t going to work for a lot o the workloads, and we needed to build huge Layer 2 networks. Although some non-virtualization uses, the Layer-3 everywhere solution works great. To take a look at a wonderfully multi-path, high bandwidth environment, check out Brad Hedlund’s own blog entry on creating a Hadoop super network with shit-tons of bandwidth out of 10G and 40G low latency ports.

Overlays

Which brings me to overlays. There are some that propose overlay networks, such as VXLAN, NVGRE, and Nicira as solutions to the Ethernet multipathing problem (among other problems). An overlay technology like VXLAN not only brings us back to the glory days of no spanning-tree by routing to the access layer, but solves another issue that plagues large scale deployments: 4000+ VLANs ain’t enough. VXLAN for instance has a 24-bit identifier on top of the normal 12-bit 802.1Q VLAN identifier, so that’s 236 separate broadcast domains, giving us the ability to support 68,719,476,736 VLANs. Hrm.. that would be….

While I like (and am enthusiastic) about overlay technologies in general, I’m not convinced they are the  final solution we need for Ethernet’s current forwarding limitations. Building an overlay infrastructure (at least right now) is a more complicated (and potentially more expensive) prospect than TRILL/SPB, depending on how you look at it. Also availability is an issue currently (likely to change, of course), since NVGRE has no implementations I’m aware of, and VXLAN only has one (Cisco’s Nexus 1000v). Also, VXLAN doesn’t terminate into any hardware currently, making it difficult to put in load balancers and firewalls that aren’t virtual (as mentioned in the Packet Pusher’s VXLAN podcast).

Of course, I’m afraid TRILL doesn’t have it much better in the way of availability. Only two vendors that I’m aware of ship TRILL-based products, Brocade with VCS and Cisco with FabricPath, and both FabricPath and VCS only run on a few switches out of their respective vendor’s offerings. As has often been discussed (and lamented), TRILL has a new header format, new silicon is needed to implement TRILL (or TRILL-based) offerings in any switch. So sadly it’s not just a matter of adding new firmware, the underlying hardware needs to support it too. For instance, the Nexus 5500s from Cisco can do TRILL (and the code has recently been released) while the Nexus 5000 series cannot.

It had been assumed that the vendors that use merchant silicon for their switches (such as Arista and Dell Force10) couldn’t do TRILL, because the merchant silicon didn’t. Turns out, that’s not the case. I’m still not sure which chips from the merchant vendors can and can’t do TRILL, but the much ballyhooed Broadcom Trident/Trident+ chipset (BCM56840 I believe, thanks to #packetpushers) can do TRILL. So anything built on on Trident should be able to do TRILL. Which right now is a ton of switches. Broadcom is making it rain Tridents right now. The new Intel/Fulcrum chipsets can do TRILL as well I believe.

And TRILL though does have the advantage of boing stupid easy. Ethan Banks and I were paired up during NFD2 at Brocade, and tasked with configuring VCS (built on pre-standard TRILL). It took us 5 minutes and just a few commands. FabricPath (Cisco’s pre-standard implementation built on TRILL) is also easy: 3 commands. If you can’t configure FabricPath, you deserve the smug look you get from Smug Cisco Guy. Here is how you turn on FabricPath on a Nexus 7K:

switch# config terminal 
switch(config)# feature-set fabricpath
switch(config)# mac address learning-mode conversational vlan 1-10 

Non-overlay solutions to STP without TRILL/SPB/QFabirc/etc. include MLAG (commonly known as Cisco’s trademarked  term Etherchannel) and MC-LAG (Multi-chassis Link Aggregration), also known as VLAG, vPC, VSS depending on the vendor. They also provide multi-pathing in a sense that while there are multiple active physical paths, no single flow will have more than one possible path, providing both redundancy and full link utilization. But it’s all manually configured at each link, and not nearly as flexible (or easy) as TRILL to instantiate. MLAG/MC-LAG can provide simple multi-path scenarios, while TRILL is so flexible, you can actually get yourself into trouble (as Ivan has mentioned here). So while MLAG/MC-LAG work as workarounds, why not just fix what they workaround? It would be much simpler.

Vendor Lock-In or FUD?

Brocade with VCS and Cisco with FabricPath are currently proprietary implementations of TRILL, and won’t work with each other or any other version of TRILL. The assumption is that when TRILL becomes more prevalent, they will have standards-based implementations that will interoperate (Cisco and Brocade have both said they will). But for now, it’s proprietary. Oh noes! Some vendors have decried this as vendor lock-in, but I disagree. For one, you’re not going to build a multi-vendor fabric, like staggering two different vendors every other rack. You might not have just one vendor amongst your networking gear, but your server switch blocks, core/aggregation, and other such groupings of switches are very likely to be single vendor. Every product has a “proprietary boundary” (new term! I made it!). Even token ring, totally proprietary, could be bridged to traditional Ethernet networks. You can also connect your proprietary TRILL fabrics to traditional STP domains at the edge (although there are design concerns as Ivan Pepelnjak has noted).

QFabric will never interoperate with another vendor, that’s their secret sauce (running on Broadcom Trident+ if the rumors are to be believed). Still, QFabric is STP-less, so I’m a fan. And like TRILL, it’s easy. My only complaint about QFabric right now is that it requires a huge port count (500+ 10 Gbit ports) to make sense (so does Nexus 7000K with TRILL, but you can also do 5500s now). Interestingly enough, Juniper’s Anjan Venkatramani did a hit piece on TRILL, but the joke is on them because it’s on tech target behind a register-wall, so no one will read it.

So far, the solutions for Ethernet forwarding are as follows: Overlay networks (may be fantastic for large environments, though very complex), Layer 3 everywhere (doable, but challenges in certain environments), and MLAG/MCAG (tough to scale, manual configuration but workable). All of that is fine. I’ve nothing against any of those technologies. In fact, I’m getting rather excited about VXLAN/Nicira overlays. I still think we should fix Layer 2 forwarding with TRILL, SPB, or something like it. And while even if every vendor went full bore on one standard, it would be several years before we were able to totally rid spanning-tree in our networks.

But wouldn’t it be grand?

Further resources on TRILL (and where to feast on brains)

The VDI Delusion Book Review

Sitting on a beach in Aruba (sorry, I had to rub that one in), I finished Madden & Company’s take on VDI: The VDI Delusion. The book is from the folks at brianmadden.com, a great resource for all things application and desktop delivery-related.

The book title suggests a bit of animosity towards VDI, but that’s not actually how they feel about VDI. Rather, the delusion isn’t regarding the actual technology of VDI, but the hype surrounding it (and the assumption many have that it’s a solve-all solution).

So the book isn’t necessarily anti-VDI, just anti-hype. They like VDI (and state so several times) in certain situations, but in most situations VDI isn’t warranted nor is it beneficial. And they lay out why, as well as the alternative solutions that are similar to VDI (app streaming, OS streaming, etc.).

It’s not a deep-dive technical book, but it really doesn’t need to be. It talks frankly about the general infrastructure issues that come with VDI, as well as delivering other types of desktop services to users across a multitude of organizations.

It’s good for the technical person (such as myself) who deal in an ancillary way with VDI (I’ve dealt with the network and storage aspects, but have never configured a VDI solution), as well as the sales persons and SE that deal with VDI. In that regard, it has a wide audience.

Brian Drew over at Dell I think summed it up the best:  

For anyone dealing with VDI (who isn’t totally immersed in the realities of it and similar technologies) this is a must-read. It’s quick and easy, and really gets down to the details.

Creating Your Own SSL Certificate Authority (and Dumping Self Signed Certs)

Jan 11th, 2016: New Year! Also, there was a comment below about adding -sha256 to the signing (both self-signed and CSR signing) since browsers are starting to reject SHA1. Added (I ran through a test, it worked out for me at least).

November 18th, 2015: Oops! A few have mentioned additional errors that I missed. Fixed.

July 11th, 2015: There were a few bugs in this article that went unfixed for a while. They’ve been fixed.

SSL (or TLS if you want to be super totally correct) gives us many things (despite many of the recent shortcomings).

  • Privacy (stop looking at my password)
  • Integrity (data has not been altered in flight)
  • Trust (you are who you say you are)

All three of those are needed when you’re buying stuff from say, Amazon (damn you, Amazon Prime!). But we also use SSL for web user interfaces and other GUIs when  administering devices in our control. When a website gets an SSL certificate, they typically purchase one from a major certificate authority such as DigiCert, Symantec (they bought Verisign’s registrar business), or if you like the murder of elephants and freedom, GoDaddy.  They range from around $12 USD a year to several hundred, depending on the company and level of trust. The benefit that these certificate authorities provide is a chain of trust. Your browser trusts them, they trust a website, therefore your browser trusts the website (check my article on SSL trust, which contains the best SSL diagram ever conceived).

Your devices, on the other hand, the ones you configure and only your organization accesses, don’t need that trust chain built upon the public infrastrucuture. For one, it could get really expensive buying an SSL certificate for each device you control. And secondly, you set the devices up, so you don’t really need that level of trust. So web user interfaces (and other SSL-based interfaces) are almost always protected with self-signed certificates. They’re easy to create, and they’re free. They also provide you with the privacy that comes with encryption, although they don’t do anything about trust. Which is why when you connect to a device with a self-signed certificate, you get one of these: So you have the choice, buy an overpriced SSL certificate from a CA (certificate authority), or get those errors. Well, there’s a third option, one where you can create a private certificate authority, and setting it up is absolutely free.

OpenSSL

OpenSSL is a free utility that comes with most installations of MacOS X, Linux, the *BSDs, and Unixes. You can also download a binary copy to run on your Windows installation. And OpenSSL is all you need to create your own private certificate authority. The process for creating your own certificate authority is pretty straight forward:

  1. Create a private key
  2. Self-sign
  3. Install root CA on your various workstations
Once you do that, every device that you manage via HTTPS just needs to have its own certificate created with the following steps:
  1. Create CSR for device
  2. Sign CSR with root CA key
You can have your own private CA setup in less than an hour. And here’s how to do it.

Create the Root Certificate (Done Once)

Creating the root certificate is easy and can be done quickly. Once you do these steps, you’ll end up with a root SSL certificate that you’ll install on all of your desktops, and a private key you’ll use to sign the certificates that get installed on your various devices.

Create the Root Key

The first step is to create the private root key which only takes one step. In the example below, I’m creating a 2048 bit key:

openssl genrsa -out rootCA.key 2048

The standard key sizes today are 1024, 2048, and to a much lesser extent, 4096. I go with 2048, which is what most people use now. 4096 is usually overkill (and 4096 key length is 5 times more computationally intensive than 2048), and people are transitioning away from 1024. Important note: Keep this private key very private. This is the basis of all trust for your certificates, and if someone gets a hold of it, they can generate certificates that your browser will accept. You can also create a key that is password protected by adding -des3:

openssl genrsa -des3 -out rootCA.key 2048

You’ll be prompted to give a password, and from then on you’ll be challenged password every time you use the key. Of course, if you forget the password, you’ll have to do all of this all over again.

The next step is to self-sign this certificate.

openssl req -x509 -new -nodes -key rootCA.key -sha256 -days 1024 -out rootCA.pem

This will start an interactive script which will ask you for various bits of information. Fill it out as you see fit.

You are about to be asked to enter information that will be incorporated
into your certificate request.
What you are about to enter is what is called a Distinguished Name or a DN.
There are quite a few fields but you can leave some blank
For some fields there will be a default value,
If you enter '.', the field will be left blank.
-----
Country Name (2 letter code) [AU]:US
State or Province Name (full name) [Some-State]:Oregon
Locality Name (eg, city) []:Portland
Organization Name (eg, company) [Internet Widgits Pty Ltd]:Overlords
Organizational Unit Name (eg, section) []:IT
Common Name (eg, YOUR name) []:Data Center Overlords
Email Address []:none@none.com

Once done, this will create an SSL certificate called rootCA.pem, signed by itself, valid for 1024 days, and it will act as our root certificate. The interesting thing about traditional certificate authorities is that root certificate is also self-signed. But before you can start your own certificate authority, remember the trick is getting those certs in  every browser in the entire world.

Install Root Certificate Into Workstations

For you laptops/desktops/workstations, you’ll need to install the root certificate into your trusted certificate repositories. This can get a little tricky. Some browsers use the default operating system repository. For instance, in Windows both IE and Chrome use the default certificate management.  Go to IE, Internet Options, go to the Content tab, then hit the Certificates button. In Chrome going to Options and Under The Hood, and Manage certificates. They both take you to the same place, the Windows certificate repository. You’ll want to install the root CA certificate (not the key) under the Trusted Root Certificate Authorities tab. However, in Windows Firefox has its own certificate repository, so if you use IE or Chrome as well as Firefox, you’ll have to install the root certificate into both the Windows repository and the Firefox repository. In a Mac, Safari, Firefox, and Chrome all use the Mac OS X certificate management system, so you just have to install it once on a Mac. With Linux, I believe it’s on a browser-per-browser basis.

Create A Certificate (Done Once Per Device)

Every device that you wish to install a trusted certificate will need to go through this process. First, just like with the root CA step, you’ll need to create a private key (different from the root CA).

openssl genrsa -out device.key 2048

Once the key is created, you’ll generate the certificate signing request.

openssl req -new -key device.key -out device.csr

You’ll be asked various questions (Country, State/Province, etc.). Answer them how you see fit. The important question to answer though is common-name.

Common Name (eg, YOUR name) []: 10.0.0.1

Whatever you see in the address field in your browser when you go to your device must be what you put under common name, even if it’s an IP address.  Yes, even an IP (IPv4 or IPv6) address works under common name. If it doesn’t match, even a properly signed certificate will not validate correctly and you’ll get the “cannot verify authenticity” error. Once that’s done, you’ll sign the CSR, which requires the CA root key.

openssl x509 -req -in device.csr -CA rootCA.pem -CAkey rootCA.key -CAcreateserial -out device.crt -days 500 -sha256

This creates a signed certificate called device.crt which is valid for 500 days (you can adjust the number of days of course, although it doesn’t make sense to have a certificate that lasts longer than the root certificate). The next step is to take the key and the certificate and install them in your device. Most network devices that are controlled via HTTPS have some mechanism for you to install. For example, I’m running F5’s LTM VE (virtual edition) as a VM on my ESXi 4 host. Log into F5’s web GUI (and should be the last time you’re greeted by the warning), and go to System, Device Certificates, and Device Certificate. In the drop down select Certificate and Key, and either past the contents of the key and certificate file, or you can upload them from your workstation.

After that, all you need to do is close your browser and hit the GUI site again. If you did it right, you’ll see no warning and a nice greenness in your address bar.

And speaking of VMware, you know that annoying message you always get when connecting to an ESXi host?

You can get rid of that by creating a key and certificate for your ESXi server and installing them as /etc/vmware/ssl/rui.crt and /etc/vmware/ssl/rui.key.

I, For One, Welcome Our New OpenFlow Overlords

When I first signed up for Networking Field Day 2 (The Electric Boogaloo), I really had no idea what OpenFlow was. I’d read a few articles, listened to a few podcasts, but still only had a vague idea of what it was. People I respect highly like Greg Ferro of Packet Pushers were into it, so it had my attention. But still, not much of a clue what it was. I attended the OpenFlow Symposium, which preceeded the activites of Networking Field Day 2, and had even less of an idea of what it was.

Then I saw NEC (really? NEC?) do a demonstration. And my mind was blown.

Side note: Let this be a lesson to all vendors. Everything works great in a PowerPoint presentation. It also conveys very little about what a product actually does. Live demonstrations are what get grumpy network admins (and we’re all grumpy) giddy like schoolgirls at  Justin Bieber concert. You should have seen Ivan Pepelnjak

I’m not sure if I got all my assumptions right about OpenFlow, so feel free to point out if I got something completely bone-headedly wrong. But from what I could gather, OpenFlow could potentially do a lot of things:

  • Replace traditional Layer 2 MAC learning and propagation mechanisms
  • Replace traditional Layer 3 protocols
  • Make policy-based routing (routing based on TCP/UDP port) something useful instead of a one-off, pain in the ass, ok just-this-one time creature it is now
  • Create “traceroute on steroids”

Switching (Layer 2)

Switching is, well, rather stupid. At least learning MAC addresses and their locations are. To forward frames, switches need to learn which ports to find the various MAC addresses. Right now the only way they learn about it is listening to the cacophony of hosts broadcasting and spewing frames. And when one switch learns a MAC address, it’s not like it tells the others. No, in switching, every switch is on its own for learning. In a single Layer 2 domain, every switch needs to learn where to find every MAC address on its own.

Probably the three biggest consequences of this method are as follows

  • No loop avoidance. The only way to prevent loops is to prevent redundant paths (i.e. spanning-tree protocol)
  • Every switch in a Layer 2 domain needs to know every frickin’ MAC address. The larger the Layer 2 domain, the more MAC addresses need to be learned. Suddenly, a CAM table size of 8,000 MAC addresses doesn’t seem quite enough.
  • Broadcasts like woah. What happens when a switch gets a frame that it doesn’t have a CAM entry for? BROADCAST IT OUT ALL PORTS BUT THE RECEIVING PORT. It’s the all-caps typing of the network world.
For a while in the early 2000’s we could get away with all this. Multi-layer switches (switches that did Layer 3 routing as well) got fast enough to route as fast as they could switch, so we could easily keep our Layer 2 domains small and just route everything.

That is, until VMware came and screwed it all up. Now we had to have Layer 2 domains much larger than we’d planned for. 4,000 entry CAM tables quickly became cramped.

MAC learning would be more centralized with OpenFlow. ARP would still be there at the edge, so a server would still think it was communicating with a regular switch network. But OpenFlow could determine which switches need to know what MAC addresses are where, so every switch doesn’t need to learn everything.

And no spanning-tree. Loop avoidance is prevented by the OpenFlow controller. No spanning-tree (although you can certainly do spanning-tree at the edge to communicate with legacy segments).

Routing (Layer 3)

Routing isn’t quite as stupid as switching. There are a number of good protocols out there that will scale pretty well, but it does require configuration on each device. It’s dynamic in that it can do multi-pathing (where traditional Layer 2 can’t), as well as recover from dead links without taking down the network for several (dozens of) seconds.  but it doesn’t quite allow for centralized control, and it has limited dynamic ability. For instance, there’s not mechanism to do “oh, hey, for right now why don’t we just move all these packets from this source to that source” in an efficient way. Sure, you can inject some host routes to do that, but it’s got to come from some sort of centralized controller.

Flow Routing (Layer 4)

So why stop at Layer 3? Why not route based on TCP/UDP header information? It can be done with policy-based routing (PBR) today, but it’s not something that can be communicated from router to router (OSPF cares not how you want to direct a TCP port 80 flow versus a TCP port 443 flow).  There is also WCCP, the Web Cache Communication Protocol, which today is not used for web caches, but WAN Optimization Controllers, like Cisco’s WAAS, or Cisco’s sworn enemy, Riverbed (seriously, just say the word ‘Riverbed’ at a Cisco office).

Sure it’s watery and tastes like piss, but at least it’s not policy-based routing

A switch with modern silicon can look at Layer 3 and Layer 4 headers as easily as they can look at Layer 2 headers. It’s all just bits in the flow, man. OpenFlow takes advantage of this, and creates, for lack of a cooler term, a Layer 2/3/4 overlord.

I, for one, welcome our new OpenFlow overlords

TCAMs or shared memory, or whatever you want to call the forwarding tables in your multi-layer switches can be programmed at will by an OpenFlow overlord, instead of being populated by the lame-ass Layer 2, Layer 3, and sometimes Layer 4 mechanisms on a switch-by-switch basis.

Since we can direct traffic based on flows throughout a multi-switch network, there’s lots of interesting things we can do with respect to load balancers, firewalls, IPS, caches, etc. Pretty interesting stuff.

Flow View (or Traceroute on Steroids)

I think one of the coolest demonstrations from NEC was when they showed the flow maps. They could punch up any source and destination address (IP or MAC) and there would be a graphical representation of the flow (and which devices they went through) on the screen. The benefits for that would be obvious. Server admin complain about slowness? Trace the flow, and check the interfaces on all the transit devices. That’s something that might take quite a while in a regular route/switch network, but can be done in a few seconds with an OpenFlow controller.

An OpenFlow Controller Tracks a Flow

To some extent, there are other technologies that can take care of some of these issues. For instance, TRILL and SPB take a good wack at the Layer 2 bullshit. Juniper’s QFabric does a lot of the ain’t-nothin-but-a-tuple thang and switches based on Layer2/3 information. But in terms of potential, I think OpenFlow has them all beat.

Don’t get too excited right now though, as NEC is the only vendor that has working implementation of OpenFlow controller, and other vendors are working on theirs. Standford apparently has OpenFlow up and running in their environment, but its all still in the early stages.

Will OpenFlow become the future? Possibly, quite possibly. But even if what we now call OpenFlow isn’t victorious, something like it will be. There’s no denying that this approach, or something similar, is a much better way to handle traffic engineering in the future than our current approach. I’ve only scratched the surface of what can be done with this type of network design. There’s also a lot that can be gained in terms of virtualization (an OpenFlow vSwitch?) as well as applications telling the network what to do. Cool stuff.

Note: As a delegate/blogger, my travel and accommodations were covered by Gestalt IT, who vendors paid to have spots during the Networking Field Day. Vendors pay Gestalt IT to present, so while my travel (hotel, airfare, meals) were covered indirectly by the vendors, no other remuneration (save for the occasional tchotchke) from any of the vendors, directly or indirectly, or by Gestalt IT was recieved. Vendors were not promised, nor did they ask for any of us to write about them, or write about them positively. In fact, we sometimes say their products are shit (when, to be honest, sometimes they are, although this one wasn’t). My time was unpaid. 

The Problem

One recurring theme from virtually every one of the Network Field Day 2 vendor presentations last week (as well as the OpenFlow symposium) was affectionately referred to as “The Problem”.

It was a theme because, as vendor after vendor gave a presentation, they essentially said the same thing when describing the problem they were going to solve. For us the delegates/bloggers, it quickly went from the problem to “The Problem”. We’d heard it over and over again so often that during the (5th?) iteration of the same problem we all started laughing like a group of Beavis and Butt-Heads during a vendor’s presentation, and we had to apologize profusely (it wasn’t their fault, after all).

Huh huhuhuhuhuh… he said “scalability issues”

In fact, I created a simple diagram with some crayons brought by another delegate to save everyone some time.

Hello my name is Simon, and I like to do draw-wrings

But with The Problem on repeat it became very clear that the majority of networking companies are all tackling the very same Problem. And imagine the VC funding that’s chasing the solution as well.

So what is “The Problem”? It’s multi-faceted and interrelated set of issues:

Virtualization Has Messed Things Up, Big Time

The biggest problem of them all was caused by the rise of virtualization. Virtualization has disrupted much of the server world, but the impact that it’s had on the network is arguably orders of magnitude greater. Virtualization wants big, flat networks, just when we got to the point where we could route Layer 3 as fast as we could switch Layer 2. We’d just gotten to the point where we could get our networks small.

And it’s not just virtualization in general, much of its impact is the very simple act of vMotion. VMs want to keep their IPs the same when they move, so now we have to bend over backwards to get it done. Add to the the vSwitch sitting inside the hypervisor, and the limited functionality of that switch (and who the hell manages it anyway? Server team? Network team?)

4000 VLANs Ain’t Enough

If you’re a single enterprise running your own network, chances are 4000+ VLANs are sufficient (or perhaps not). In multi-tenant environments with thousands of customers, 4000+ VLANs quickly becomes a problem. There is a need for some type of VLAN multiplier, something like QinQ or VXLAN, which gives us 4096 times 4096 VLANs (16 million or so).

Spanning Tree Sucks

One of my first introductions to networking was accidentally causing a bridging loop on a 10 megabit Ethernet switch (with a 100 Mbit uplink) as a green Solaris admin. I’d accidentally double-connected a hub, and I noticed the utilization LED on the switch went from 0% to 100% when I plugged a certain cable in. I entertained myself with plugging in and unplugging the port to watch the utilization LED flucutate (that is, until the network admin stormed in and asked what the hell was going on with his network).

And thus began my love affair with bridging loops. After the Brocade presentation where we built a TRILL-based Fabric very quickly, with active-active uplinks and nary a port in blocking mode, Ethan Banks became a convert to my anti-spanning tree cause.

OpenFlow offers an even more comprehensive (and potentially more impressive) solution as well. More on that later.

Layer 2 Switching Isn’t Scaling

The current method by which MAC addresses are learned in modern switches causes two problems: Only one viable path can be allowed at a time (only way to prevent loops is to prevent multiple paths by blocking ports), and large Layer 2 networks involve so many MAC addresses that it doesn’t scale.

From QFabric, to TRILL, to OpenFlow (to half a dozen other solutions), Layer 2 transforms into something Layer 3-like. MAC addresses are routed just like IP addresses, and the MAC address becomes just another tuple (another recurring word) for a frame/packet/segment traveling from one end of your datacenter to another. In the simplest solution (probably TRILL?) MAC learning is done at the edge.

There’s A Lot of Shit To Configure

Automation is coming, and in a big way. Whether it’s a centralized controller environment, or magical software powered by unicorn tears, vendors are chomping at the bit to provide some sort of automation for all the shit we need to do in the network and server world. While certainly welcomed, it’s a tough nut to crack (as I’ve mentioned before in Automation Conundrum).

Data center automation is a little bit like the Gom Jabbar. They tried and failed you ask? They tried and died.

“What’s in the box?”

“Pain. And an EULA that you must agree to. Also, man-years of customization. So yeah, pain.”

Ethernet Rules Everything Around Me

It’s quite clear that Ethernet has won the networking wars. Not that this is any news to anyone who’s worked in a data center for the past ten years, but it has struck me that no other technology has been so much as even mentioned as one for the future. Bob Metcalfe had the prophetic quote that Stephen Foskett likes to use: “I don’t know what will come after Ethernet, but it will be called Ethernet.”

But there are limitations (Layer 2 MAC learning, virtualization, VLANs, storage) that need to be addressed for it to become what comes after Ethernet. Fibre Channel is holding ground, but isn’t exactly expanding, and some crazy bastards are trying to merge the two.

Oof. Storage.

Most people agree that storage is going to end up on our network (converged networking), but there are as many opinions on how to achieve this network/storage convergence as there are nerd and pop culture reference in my blog posts. Some companies are pro-iSCSI, others pro FC/NFS, and some like Greg Ferro have the purest of all hate: He hates SCSI.

“Yo iSCSI, I’m really happy for you and imma let you finish, but Fibre Channel is the best storage protocol of all time”

So that’s “The Problem”. And for the most part, the articles on Networking Field Day, and the solutions the vendors propose will be framed around The Problem.

Fsck It, We’ll Do It All With SSDs!

At Tech Field Day 8, we saw presentations from two vendors that had an all-flash SAN offering, taking on a storage problem that’s been brewing in data centers for a while now, and the skewed performance/capacity scale.

While storage capacity has been increasing exponentially, storage performance hasn’t caught up nearly that fast. In fact, performance has been mostly stagnant, especially in the area where it counts: Latency and IOPS (I/O Operations Per Second).

It’s all about the IOPS, baby

In modern data centers, capacity isn’t so much of an issue with storage. Neither is the traditional throughput metric, such as megabytes per second. What really counts is IOPS and latency/seek time. Don’t get me wrong, some data center applications certainly have capacity requirements, as well as potential throughput requirements, but for the most part these are easily met by today’s technology.

IOPS and latency are super critical for virtual desktops (and desktops in general) and databases. If you computer is sluggish, it’s probably not a lack of RAM or CPU, by and large it’s a factor of IOPS (or lack thereof).

There are a few tricks that storage administrators and vendors have up their sleeve to increase IOPS and drop latency.

In a RAID array, you can scale IOPS linerally by just throwing more disks at the array. If you have a drive that does 100 IOPS per second, add a second drive for a RAID 0 (mirror) and you’ve got double the IOPS. Add a third and you’ve got 300 IOPS (and of course add more for redundancy).

Another trick that storage administrators have up their sleeve is the technique known as “short stroking“, where only a portion of the drive is used. In a spinning platter, the outside is spinning the fastest, giving the best performance. If you only format that out portion, the physical drive head doesn’t have to travel as far. This can reduce seek time substantially.

Tiered storage can help with both latency and IOPS, were a combination of NVRAM, SSDs, and hard drives are combined.”Hot” data is accessed from high-speed RAM cache, “warm” data is on a bank of SSDs, and “cold” data would be stored on cheaper SAS or (increasingly) consumer SATA drives.

And still our demand for IOPS is insatiable, and the tricks in some cases aren’t catching up. Short stroking only goes so far, and cache misses can really impact performance for tiered storage. While IOPS scale linearly, the IOPS we need can sometimes end up with racks full of spinning rust, while only using a tenth of the actual capacity. That’s a lot of wasted space and wasted power.

And want to hear a depressing fact?

A high-end enterprise SAS 15,000 RPM drive (which spins faster than most jet engines) gives you about 150 IOPS in performance (depending on the workload of course). A good consumer grade SSD from Newegg gives you around 85,000 IOPS. That means you would need almost 600 drives to equal the performance of one consumer grade SSD.

That’s enough to cause anyone to have a Bill O’Reilly moment.

600 drives? Fuck it, we’ll do it with all flash!

No one is going to put their entire database or virtual desktop infrastructure on a single flash drive of course. And that’s where vendors like Pure Storage and SolidFire come into play. (You can see Pure Storage’s presentation at Tech Field Day 8 here. SolidFire’s can be seen here.)

The overall premise with the we’ll-do-it-all-in-flash play is that you can take a lot of consumer grade flash drives, use the shitload of IOPS that they bring, and combine it with a lot of storage controller CPU power for deduplication and compression. With that combination, they can offer an all-flash based array at the same price per gig as traditional arrays comprise of spinning rust (disk drives).

How many IOPS are we talking about? SolidFire’s SF3010 claims 50,000 IOPS per 1 RU node. That would replace over 300 drives of traditional drives, which I don’t think you can put in 1RU. Pure Storage claims 300,000 IOPS in 8U of space. With a traditional array, you’d need over 2000 drives, also unlikely to fit in 8 RU. Also, imagine the power savings, with only 250 watts needed for SoldFire’s node, and 1300 Watts for the PureStorage cluster.  And Both allow you to scale up by adding more nodes.

You wire them into your SAN the traditional ways, as well. The Pure Storage solution has options for 10 Gbit iSCSI and 8 Gbit Fibre Channel, while the SolidFire solution is iSCSI only. (Sadly, neither support FCoE or FCoTR.)

For organizations that are doing virtual desktops or databases, an all-flash storage array with the power savings and monster IOPS must look more tantalizing than a starship full of green-skinned girls does to Captain Kirk.

There is a bit of a controversy in that many of the all-flash vendors will tell you capacity numbers with deduplication and compression taken into account. At the same time, if the performance is better than spinning rust even with the compression/dedupe, then who cares?

So SSD it is. And as anyone who has an SSD in their laptop or desktop will tell you, that shit is choice. Seriously, I get all Charlton Heston about my SSD.

You’ll pry this SSD from my cold, dead hands

It’s not all roses and unicorn-powered SSDs. There are two issues with the all-flash solution thus far. One is that they don’t have a name like NetApp, EMC, or Fujitsu, so there is a bit of a trust issue there. The other issue is that many have some negative preconceptions about flash, such as they have a high failure rate (due to a series of bad firmwares from vendors) and the limited write cycle of memory cells (true, but mitigaitable). Pure Storage claims to have never had a drive fail on them (Amy called them flash driver whispers).

Still though, check them (and any other all-SSD vendor) out. This is clearly the future in terms of high performance storage where IOPS is needed. Spinning rust will probably rule the capacity play for a while, but you have to imagine its days are numbered.