ZFS and Linux and Encryption Part 1: Raining Hard Drives →

Do We Need Chassis Switches Anymore in the DC?

July 5, 2017 9 Comments

While Cisco Live this year was far more about the campus than the DC, Cisco did announce the Cisco Nexus 9364C, a spine-oriented switch which can run in both ACI mode and NX-OS mode. And it is a monster.

It’s (64) ports of 100 Gigabit. It’s from a single SoC (the Cisco S6400 SoC).

It provides 6.4 Tbps in 2RU, likely running below 700 watts (probably a lot less). I mean, holy shit.

9364c

Cisco Nexus 9364C: (64) ports of 100 Gigabit Ethernet.

And Cisco isn’t the only vendor with an upcoming 64 port 100 gigabit switch in a 2RU form factor. Broadcom’s Tomahawk II, successor to their 25/100 Gigabit datacenter SoC, also sports the ability to have (64) 100 Gigabit interfaces. I would expect the usual suspects to announce switches based on these soon (Arista, Cisco Nexus 3K, Juniper, etc.)

And another vendor Innovium, while far less established, is claiming to have a chip in the works that can do (128) 100 Gigabit interfaces. On a single SoC.

For modern data center fabric, which rely on leaf/spine Clos style topologies, do we even need chassis anymore?

For a while we’ve been reliant upon the Sith-rule on our core/aggregation: Always two. A core/aggregation layer is a traditional (or some might say legacy now) style of doing a network. Because of how spanning-tree, MC-LAG, etc., work, we were limited to two. This Core/Aggregation/Access topology is sometimes referred to as the “Christmas Tree” topology.

xmastree

Traditional “Christmas Tree” Topology

Because we could only have two at the core and/or aggregation layer, it was important that these two devices be highly redundant. Chassis would allow redundancy in critical components, such as fabric modules, line cards, supervisor modules, power supplies, and more.

Fixed switches tend to not have nearly the same redundancies, and as such weren’t often a good choice for that layer. They’re fine for access, but for your host’s default gateways, you’d want a chassis.

Leaf/spine Clos topologies, which relies on Layer 3 and ECMP, and isn’t restricted the same way Layer 2 spanning-tree and MC-LAG is, is seeing a resurgence after having been banished from the DC because of vMotion.

clos

Leaf/Spine Clos Topology

Modern data center fabrics utilize overlays like VXLAN to provide layer 2 adjacencies required by vMotion. And again we’re not limited to just two devices on the spine layer: You can have 2, 3, 4.. sometimes up to 16 or more depending on the fabric. They don’t have to be an even number, nor do they need to be a power of two now that most switches use a higher than 3-bit hash for ECMP (the 3-bit hash was the origin of the previous powers of 2 rule for LAG/ECMP).

Now we have an option: Do leaf/spine designs concentrate on larger, more port-dense chassis switches for the spine, or do we go with fixed 1, 2, or 4RU spines?

The benefit of a modular chassis is you can throw a lot more ports on them. They also tend to have highly redundant components, such as fans, power supplies, supervisor modules, fabric modules, etc. If any single component fails, the chassis is more likely keep on working.

They’re also upgradable. Generally you can swap out many of the components, allowing you to move from one network speed to the next generation, without replacing the entire chassis. For example, on the Nexus 9500, you can go from 10/40 Gigabit to 25/100 Gigabit by swapping out the line cards and fabric modules.

However, these upgrades are pretty expensive comparatively. In most cases, fixed spines would be far cheaper to swap out entirely compared to upgrading a modular chassis.

And redundancy can be provided by adding multiple spines. Even 2 spines gives some redundancy, but 3, 4, or more can provide better component redundancy than a chassis.

So chassis or fixed? I’m leaning more towards a larger number of fixed switches. It would be more cost effective in just about every scenario I can thing of, and still provides the same forwarding capacity of a more expensive chassis configuration.

So yeah, I’m liking the fixed spine route.

What do you think?

Filed under Uncategorized

9 Responses to Do We Need Chassis Switches Anymore in the DC?

James says:

July 6, 2017 at 2:27 am

I totally agree. If people have been willing to adapt their DC POD’s to the concept of and install FEX then this is the obvious next leap.

Reply
Simon says:

July 6, 2017 at 10:18 am

The redundancy aspect is huge – you can do maintenance on one spine switch and still have it.
Upgrades between generations stop requiring outages too, if you’re just pushing packets at layer 3 with ECMP without even VRRP switching bits out becomes transparent.

Reply
Art Fewell says:

July 7, 2017 at 7:27 am

I have felt this way for years, you should not buy a modular chassis for this reason, but there are other problems with fixed switch design you should consider. The bigger reason to look at a modular chassis is the cost of CLOS. IF you want a non-oversubscribed fabric, you have to burn half of the ports uplinking … and what would you uplink to? For just 2 leafs you burn all 64 ports if using this chassis as the spine, so then you need a fancy spine, its size will dictate the size of the pod and is actually less flexible than a modular mostly because uplinks in modular chassis tend to be much cheaper. In a module chassis, the non-blocking uplinks to other linecards are included. For example in a fixed chassis I pay for 64 ports I use 32 as uplinks, but if that is a linecard if I have 64 external ports, there is the equivalent of an additional 64 internal ports that are included in the cost that are not included in the fixed chassis. The simple math of this means most linecards end up being cheaper per actual bandwidth. A 64 port 10gbe linecards are typically priced close to a 64 port fixed chassis switch.

If you actually run some topology designs with this in mind, you will see very drastic differences. You can determine the amount of switches and uplinks for a given CLOS by knowing the number of access ports and the oversub ratio, if you want non oversubscribed its simple 2:1 so half your ports are uplinked to the spine … BUT remember, this spine is only east-west traffic. In CLOS spine is literally equivalent to a chassis backbone and does not uplink to anything other than TOR which means north-south uplinks need to be reserved on your ToRs in addition to the half you reserved for the spine uplinks. Even if you did uplink the spine directly, it still would not change the fact you have to burn physical ports on these fixed switches, a LOT of them that you just dont think about in a modular since the backbone of the switch is included. Run a couple designs needing 200, 500, 1000 access ports and build them out, you will see what I mean.

Reply
- tonybourke says:
  
  July 7, 2017 at 12:07 pm
  
  Take a typical 48×6 switch (48 ports of 10/25 and 6 ports of 40/100) and you’ve got about 2:1 over subscription (less if you do 100 gig uplinks and 10 gig host facing, which most is right now). You can build, with 6 spines, 64 leafs, about 3,000 host facing ports in a pod. That would fit most scales.
  
  If that’s not big enough, you can add a third stage, hooking the spines into ultra spines, though these pods would be oversubscribed. It would depend on how the east/west traffic was managed.
  
  I think while there are some situations where a chassis makes more sense, I also think given the cost effectiveness of fixed spines, they probably make sense in more situations than we’d previously collectively thought.
  
  Reply
  - Art Fewell says:
    
    July 7, 2017 at 4:33 pm
    
    You can definitely build any size as CLOS or Classic, and you can build either CLOS or Classic with only fixed or with only modular or a combo of fixed and modular.
    
    For my comments I am assuming a CLOS style physical cable plant, and whether that makes more sense to do with a modular or fixed chassis. As a basis for my argument, I am going to assume that a fixed chassis 48x10GbE switch is similar in price to a 48x10Gbe linecard. OF course the linecard needs additional chassis costs to operate, and conversely with fixed you need to have a much larger quantity of managed endpoints and a LOT more cables among other overhead, so for the purpose of this argument I will consider this extra overhead to be moot.
    
    When you weigh the cost for these options, remember that the 48x10Gbe linecard is actually a 96 port card with 48 external facing ports and 48 internal facing ports (assuming non-blocking, if oversubscribed similar concept still applies). These internal ports provide connectivity to all the other modules in the chassis. Assume you have a chassis with 6 48-port modules, those modules can all connect to each other inherently, and you can use all the 48 ports to connect to servers (or other non-fabric uplinks). Alternatively, if you tried to build the same system with fixed 48 port switches, you would need to use physical cables for each switch to connect to each other rather than the chassis backplane. So while in a modular chassis 6 48-port blades gives you 6×48=288 usable external facing ports, and the modular linecard includes the additonal 6×48 internal fabric ports for the blades to connect to each other. In the fixed chassis, you need to burn half the ports as uplinks, so now you need 12 x 48-port switches just to provide the 288 leaf ports and another 288 spine-facing ports, but you still dont have a spine, now you need another 6 of these 48 port switches to act as the spine layer, and this spine layer would be at 100% capacity and unable to scale-out without some serious re-architecture. So you need 18 fixed switches in this scenario with 288 uplink cables, that is only to provide 288 usable ports, all of which could be done in a pair of chassis and still using nice clos-spine/leaf architecture.
    
    So I am a big fan of fixed and using it where it makes sense, but just because you can design almost any size network with a fixed-chassis-clos design, does not mean you should 🙂
- ablogger says:
  
  July 8, 2017 at 11:50 am
  
  Not sure I understand why access-agg topology require less number of ports than spine-leaf. For same uplink bandwidth, same number of ports are needed – given that ports types are same !
  
  L2 spine-leaf is apparently possible with TRILL, but I have not come across any implementation. L2 suffers form loop and multi-path issues. The band aids like xSTP and MLAG are hard to operate , not very scalable , and wasteful. On the other hand L3 handles loop and multi-path very well. Ability to summarize L3 addresses also helps in keeping forwarding table sizes reasonable. Those are the reasons why L3 spine-leaf is so prevalent.
  
  Chassis were a good cost cutting measure when control and switching used to be expensive. Hence centralizing those functions reduced cost. Chassis are relevant in the general IT space where proprietary (semi) high speed links can be leveraged within the chassis
  
  Reply
Art Fewell says:

July 8, 2017 at 8:05 am

I should be more clear here that I really do agree with your point, with the massive increases in density of both compute and network, people really need to think about what sizes they actually need. I remember back with the original Trident chip came out a lot more people were using 6509’s which maxed at 64x10GbE ports which the Trident could do much better in 1RU. Its important to remember in every city the # of cisco reps is a function of total sales and margins, and if you can replace a pair of 6509’s with a cheap pair of 1RU fixed switches (or the more modern equivalent), clearly your sales reps cannot sell you the 1RU switches if they are going to keep their jobs … that isnt a cisco dig, it is true of every vendor in that position and its a tough problem for them … and it is absolutely real, every customer should expect they are likely being oversold now more than any point they can remember due to these new innovations. I would imagine a large majority of 7/9k chassis customers could have been perfectly fine with a pair of 3ks, but no rep would have ever told them that. As for fixed switches, I became a really big fan back with the trident, at that time fixed were always faster to market and always cheaper, but then Arista changed that, and now with the 9ks it seems modular are consistently getting to market as fast and cheaper-per-bandwidth when you consider the internal backplane-facing ports. I was very enamored by fixed switches and it became my job to write marketing materials in support of that architecture and do the TCO modeling. I wrote a program that would calculate a CLOS-cabled BOM for a fixed and for a modular design for an arbitrary # of ports for a given oversub ratio and after all that work felt I had been a bit too giddy on the merits of fixed, they are awesome and will remove the need for many to have giant mod chassis, but still have their weaknesses.

Reply
Todd Craw says:

July 10, 2017 at 3:04 pm

I started writing an article about this around my time at Insieme and I could not get anyone else of my vendor friends to co-author it with me due to the company they worked for. Here is what I had 2-3 years ago (a few versions exist):

Does The Network Chassis Have A Future?

The networking industry is undergoing profound transformation today. SDN, open computing and disaggregation are reactions to what has become a very conservative, vendor dominated business. Progressive network engineers need to prepare themselves for the changes coming and stay relevant in skills and knowledge. The rise of big data, cloud computing and web services has made the data center the source of most of the innovations in modern networking and so it would be prudent to examine the trends coming from the top data center operators. The processes and technologies developed in a Google, Facebook, Amazon, Microsoft or Ebay eventually become mainstream as they move on to newer innovations. The best examples are the many automation tools, Mesos, Hadoop and CoreOS. This article is not intended to be pro or anti any vendor but purely about technology and architecture evolutions.
Our premise is that network engineers should examine the validity of using a chassis solution in networking and ask some tough questions: does a chassis still provide value, or is it just a vendor lock-in strategy? What functions do I need from a chassis? What functions are deprecated by modern architectures and technologies? We have helped develop, sell, market and deploy many chassis solutions over the last 15 years and the results of asking these questions may be considered provocative or controversial today.

If we look at the history of the network chassis, it evolved in the mid 1990s and served the function of scaling bandwidth, capacity and providing a central point of control using CLI, SNMP, etc. This was the nascent period of networking and density was very low so a chassis provided a lot of value and they were welcomed. Initially they had a central forwarding plane, which evolved into a distributed control plane and with this evolution has become increasing complexity.
The Problem… And How Some Big Thinkers Solved It.

With the rise of big data and the cloud, companies like Amazon and Google saw the network becoming increasingly problematic. Networks did not lower in cost at scale like other technologies so building large networks was very expensive and had huge amounts of expensive interconnects. L2 technologies did not scale and were brittle. Some very smart people at these companies examined the problem and they all seem to have reached the same conclusions about networking for their environments: L3 was inherently more scalable (see http://en.wikipedia.org/wiki/Internet) and less brittle with finer failure domains, it was easier to make the applications smarter and the network simpler, and automation was crucial.

L2 networking is proven not to scale due to limited namespaces (VLAN) and the brittleness of mac learning and mac table size at scale, spanning tree and the coarse failure domains, etc. Vendors applied technology bandaids over the years including QinQ, PBB, TRILL, SPB, vPC, MLAG, etc. to address many issues with networking in the data center but these are still brittle and do not scale in a cost-effective way. The large providers are all using a L3 (IP) fabric of some type, usually based on the principles of Charles Clos, usually on commodity products as a result.
What Does A Modern Chassis Look Like?

A modern Switch or Router chassis is a complicated distributed system. Many hardware vendors have traditionally operated under the assumption that the chassis has provable correctness in it’s processing. Computer science has already proven this wrong and this has also been seen in the practical applications such as distributed forwarding methods or flow routing/switching losing consistency without a way to self-repair and achieve eventual consistency. It is common for a modern chassis to implement a consistency check for the various tables, RIBs (Routing Information Base – System DRAM) and FIBs (Forwarding Information Base – usually CAM) in the system.

If we look at running a consistency checker on a modern Router or Switch with a chassis that supports a million-entry route table: Add the number of Input/Output (I/O) modules (Let’s say 16) and then add the number of forwarding engines (Let’s say 2). The amount of entries that need to be checked from RIB to FIB to HW will be very cpu intensive and will usually be batch scheduled. If you start a CC that can examine 8k entries per second it will be a long time before we can finish the check for a million routes. It will probably take over 15 minutes.
All vendors have had issues because of this but if we use the largest vendor for a well-known example: Cisco Express Forwarding (CEF) was a great technology for scaling forwarding performance by using a distributed FIB and was one of the first times a distributed system approach was taken in networking but it was also notorious for having consistency problems due to software bugs and Cisco eventually had to implement a consistency checker for the 7500 and GSR 12000 series routers: http://www.cisco.com/en/US/tech/tk827/tk831/technologies_tech_note09186a00800946f7.shtml. There are also many notes and tools on troubleshooting CEF and a small library of books on the technology.

Making The Network Simpler

If you look at the big data/cloud businesses such as Google, Amazon, Facebook, Ebay, etc. it is clear that they understood the problems of a switch chassis as a complex distributed system. They have MASSIVE distributed systems that need to insure eventual consistency that is provable. These businesses cannot afford distributed systems errors or they will ship the wrong thing to the wrong person, expose personal data to the wrong persons or offer a service that does not function properly.

The fact is that networking vendors in general still have not understood this problem. This is a HUGE issue to the large datacenter operators and one of the reasons why many are building their own switches and routers or using an open source OS on commodity hardware. This is why they also focus on deploying many small boxes over fewer bigger ones. Routing protocols do a REALLY good job of insuring eventual consistency (at least if you’ve implemented it with a understanding of why some of the rules are in the protocols). As the table sizes have gotten bigger and/or we’ve had to add more distributed components to operate the box, we’ve introduced more and more possible places for errors to creep in. It has been going on for a long time and network vendors have many intense internal discussions and discussions with customers around these issues.

Clos In A Box – Modern Chassis Design

A modern chassis is usually based on merchant silicon and looks like a Clos fabric inside a chassis of sheet metal. The only difference is that instead of using well-understood, standard protocols like OSPF, ISIS or BGP to communicate they use proprietary mechanisms.
We submit that it is much simpler to use standard routing protocols between these components and either use fixed switches or a very simple chassis that provides common power, cooling and simplifies interconnection of the Leaf (Front Ports) and Spine (Switch Fabric) components. It is not common knowledge but it can be discerned from different sources that this is exactly the approach that the large data center providers have chosen in many cases. Troubleshooting and repairing an open, well-defined protocol is much easier than troubleshooting and repairing a proprietary method. Routing protocols have been proven over years of standardization, software development and testing and fixed switches with routing protocols are more predictable in their behavior.

The Scale Problem Goes Away

The scalability of fixed switches also makes this much more viable as today you can get a single merchant silicon ASIC and a reference platform fixed switch that can provide over a Tbps of performance with new solutions coming out that will take this close to 3 Tbps. This is so much performance in one ASIC/fixed switch that makes the chassis unneccesary for most customers. You can used fixed switches today to create a leaf-and-spine Clos network that will easily support thousands of servers in a single pod or cluster. You can scale this with multi-tier fabrics to the tens and hundreds of thousands of servers today.
“As technology has advanced and fixed form factor solutions have proven lower cost, lower latency with a higher degree of reliability, we have become believers that the spine and leaf fabric design which Dell Networking has based on open standards is not only here to stay but, in fact, has a substantial advantage over competing solutions such as proprietary lock-in fabrics.”
–Shane Stakem, Senior Director, Network Operations, Joyent Inc.

Click to access Is-Dell-Driving-the-Open-Future-of-Networking-By-Moor-Insights-and-Strategy.pdf

Automation – The Missing Piece

Mike: all about modern automation tools and the superiority vs. a CLI or SNMP managing multiple forwarding planes, maybe compare with blade servers? Individual server instead of one mega server with central control plane. This section was not finished…

Conclusion

Modern network architectures are moving to simple Leaf and Spine Layer 3 (IP) fabric utilizing fixed switches or a simple chassis with common power, cooling and interconnection. Standard routing protocols are the preferred communication method between these forwarding planes. Modern automation tools make it much easier to deploy all of these forwarding planes, control the configurations and using automation lowers operating expenses and makes human error far less likely. The traditional network chassis is looking like more of a liability in a modern network. It will be interesting to see how the network chassis evolves in the future.

Reply
- Todd Craw says:
  
  July 10, 2017 at 3:26 pm
  
  FYI, I cleaned up and fixed some of this and published on my linked in.
  
  Reply