Changing Data Center Workloads

Networking-wise, I’ve spent my career in the data center. I’m pursuing the CCIE Data Center. I study virtualization, storage, and DC networking. Right now, the landscape in the network is constantly changing, as it has been for the past 15 years. However, with SDN, merchant silicon, overlay networks, and more, the rate of change in a data center network seems to be accelerating.

speed

Things are changing fast in data center networking. You get the picture

Whenever you have a high rate of change, you’ll end up with a lot of questions such as:

  • Where does this leave the current equipment I’ve got now?
  • Would SDN solve any of the issues I’m having?
  • What the hell is SDN, anyway?
  • I’m buying vendor X, should I look into vendor Y?
  • What features should I be looking for in a data center networking device?

I’m not actually going to answer any of these questions in this article. I am, however, going to profile some of the common workloads that you find in data centers currently. Your data center may have one, a few, or all of these workloads. It may not have any of them. Your data center may have one of the workloads listed, but my description and/or requirements is way off. All certainly possible. These are generalizations, and with all generalizations your mileage may vary. With that disclaimer out of the way, strap in. Let’s go for a ride.

Traditional Virtualization

It’s interesting to say that something which only exploded into the data center in a big way in about 2008 as now being “traditional”, rather than “new-fangled”. But that’s the situation we have here. Traditional virtualization workload  is centered primarily around VMware vSphere. There are other traditional virtualization products of course, such as Red Hat’s RHEV, Xen, and Microsoft Hyper-V, but VMware has the largest market share for this by far.

  • Latency is not a huge concern (30 usecs not a big deal)
  • Layer 2 adjacencies are mandatory (required for vMotion)
  • Large Layer 2 domains (thousands of hosts layer 2 adjacent)
  • Converged infrastructures (storage and data running on the same wires, FCoE, iSCSI, NFS, SMB3, etc.)
  • Buffer requirements aren’t typically super high. Bursting isn’t much of an issue for most workloads of this type.
  • Fibre Channel is often the storage protocol of choice, along with NFS and some iSCSI as well

Cisco has been especially successful in this realm with the Nexus line because of vPC, FabricPath, OTV and (to a much lesser extent) LISP, as they address some of the challenges with workload mobility (though not all of them, such as the speed of light). Arista, Juniper, and many others also compete in this particular realm, but Cisco is the market leader.

With the multi-pathing Layer 2 technologies such as SPB, TRILL, Cisco FabricPath, and Brocade VCS (the latter two are based on TRILL), you can build multi-spine leaf/spine networks/CLOS networks that you can’t with spanning-tree based networks, even with MLAG.

This type of network is what I typically see in data centers today. However, there is a shift towards Layer 3 networks and cloud workloads over traditional virtualization, so it will be interested to see how long traditional virtualization lasts.

VDI

VDI (Virtual Desktops) are a workload with the exact same requirements as traditional virtualization, with one main difference: The storage requirements are much, much higher.

  • Latency is not as important (most DC-grade switches would qualify), especially since latency is measured in milliseconds for remote desktop users
  • Layer 2 adjacencies are mandatory (required for vMotion)
  • Large Layer 2 domains
  • Converged infrastructures
  • Buffer requirements aren’t typically very high
  • High-end storage backends. All about the IOPs, y’all

For storage here, IOPs are the biggest concern. VDI eats IOPs like candy.

Legacy Workloads

This is the old, old school. And by old school, I mean late 90s, early 2000s. Before virtualization changed the landscape. There’s still quite a few crusty old servers, with uptimes measured in years, running long-abandoned applications. The problem is, these types of applications are usually running something mission critical and/or significant revenue generating. Organizations just haven’t found a way out of it yet. And hey, they’re working right now. Often running on proprietary Unix systems, they couldn’t or wouldn’t be migrated to a virtualized environment (where it would be much easier to deal with).

The hardware still works, so why change something that works? Because it would be tough to find more. It’s also probably out of vendor-supported service.

  • Latency? Who cares. Is it less than 1 second? Good enough.
  • Layer 2 adjacencies, if even required, are typically very small, typically just needed for the local clustering application (which is usually just stink-out-loud awful)
  • 100 megabit and gigabit Ethernet typically. 10 Gigabit? That’s science-fiction talk!
  • Buffers? You mean like, what shines the floor?

buffers

My own personal opinion is that this is the only place where Cisco Catalyst switches belong in a data center, and even then only because they’re already there. If you’re going with Cisco, I think everything else (and everything new) in the DC should be Nexus.

Cloud Workloads (Private Cloud)

If you look at a cloud workload, it looks very similar to the previous traditional virtualization workload. They both use VMs sitting on top of hypervisors. They both have underlying infrastructure of compute, network, and storage to support these VMs. The difference is primarily is in the operational model.

It’s often described as the difference between pets and cattle. With traditional virtualization, you have pets. You care what happens to these VMs. They have HA and DRS and other technologies to care for them. They’re given clever names, like Bart and Lisa, or Happy and Sleepy. With cloud VMs, they’re not given fun names. We don’t do vMotion/Live Migration with them. When we need them, they’re spun up. When they’re not, they’re destroyed. We don’t back them up, we don’t care if the host they reside on dies so long as there are other hosts carrying the workload. The workload is automatically sharded across the available hosts using logic in the application. Instead of backups, templates are used to create new VMs when the workload increases. And when the workload decreases, some of the VMs get destroyed. State is not kept on any single VM, instead the state of the application (and underlying database) is sharded to the available systems.

This is very different than traditional virtualization. Because the workload distribution is handled with the application, we don’t need to do vMotion and thus have Layer 2 adjacencies. This makes it much more flexible for the network architects to put together network to support this type of workload. Storage with this type of workload also tends to be IP-based (NFS, iSCSI) rather than FC-based (native Fibre Channel or FCoE).

With cloud-based workloads, there’s also a huge self-service component. VMs are spun-up and managed by developers or end-users, rather than the IT staff. There’s typically some type of portal that end-users can use to spin up/down resources. Chargebacks are also a component, so that even in a private cloud setting, there’s a resource cost associated and can be tracked.

OpenStack is a popular choice for these cloud workloads, as is Amazon and Windows Azure. The former is a private cloud, with the later two being public cloud.

  • Latency requirements are mostly the same as traditional virtualization
  • Because vMotion isn’t required, it’s all Layer 3, all the time
  • Storage is mostly IP-based, running on the same network infrastructure (not as much Fibre Channel)
  • Buffer requirements are typically the same as traditional virtualization
  • VXLAN/NVGRE burned into the chips for SDN/Overlays

You can use much cheaper switches for this type of network, since the advanced Layer 2 features (OTV, FabricPath, SPB/TRILL, VCS) aren’t needed. You can build a very simple Layer 3 mesh using inexpensive and lower power 10/40/100 Gbit ports.

However, features such as VXLAN/NVGRE encap/decap is increasingly important. The new Trident2 chips from Broadcom support this now, and several vendors, including Cisco, Juniper, and Arista all have switches based on this new SoC (switch-on-chip) from Broadcom.

High Frequency Trading

This is a very specialized market, and one that has very specialized requirements.

  • Latency is of the utmost concern. To the point of making sure ports are on the same ASIC. Latency is measured in nano-seconds, microseconds are an eternity
  • 10 Gbit at the very least
  • Money is typically not a concern
  • Over-subscription is non-existent (again, money no concern)
  • Buffers are a trade off, they can increase latency but also prevent packet loss

This is a very niche market, one that Arista dominates. Cisco and a few other vendors have small inroads here, using the same merchant silicon that Arista uses, however Arista has had huge experience in this market. Every tick of the clock can mean hundreds of thousands of dollars in a single trade, so companies have no problem throwing huge amounts of money at this issue to shave every last nanosecond off of latency.

Hadoop/Big Data

  • Latency is of high concern
  • Large buffers are critical
  • Over-subscription is low
  • Layer 2 adjacency is neither required nor desired
  • Layer 3 Leaf/spine networks
  • Storage is distributed, sharded over IP

Arista has also extremely successful in this market. They glue PC RAM onto their switch boards to provide huge buffers (around 760 MB) to each port, so it can absorb quite a bit of bursty traffic, which occurs a lot in these types of setups. That’s about .6 seconds of buffering a 10 Gbit link. Huge buffers will not prevent congestion, but they do help absorb situations where you might be overwhelmed for a short period of time.

Since nodes don’t need to be Layer 2 adjacent, simple Layer 3 ECMP networks can be created using inexpensive and basic switches. You don’t need features like FabricPath, TRILL, SPB, OTV. Just fast, inexpensive, low power ports. 10 Gigabit is the bare minimum for these networks, with 40 and 100 Gbit used for connectivity to the spines. Arista (especially with their 7500E platform) does very well in this area. Cisco is moving into this area with the Nexus 9000 line, which was announced late last year.

 

Conclusions

Understanding the requirements for the various workloads may help you determine the right switches for you. It’s interesting to see how quickly the market is changing. Perhaps 2 years ago, the large-Layer 2 networks seemed like the immediate future. Then all of a sudden Layer 3 mesh networks became popular again. Then you’ve got SDN like VMware’s NSX and Cisco’s ACI on top of that. Interesting times, man. Interesting times.

5 Responses to Changing Data Center Workloads

  1. stimit says:

    So what is required of the merchant silicon vendors in their next iteration of chips? Just improved port speeds, buffers?

    • tonybourke says:

      Better buffers, for one. The Tridenet2 doesn’t have that great buffer space. More features like VXLAN termination. VXLAN routing isn’t in the Trident2 natively, you can either user another chip (as Cisco does) or hairpin the Trident2 (which limits the throughput to 10 or 40 Gbit). It takes about 2 years to get features backed into merchant silicon I believe.

      • Some comments:

        – The need for buffering is driven by disparate link speed between devices (40G storage talking to 10G host) or incast (multiple flows egressing a single link). This can manifest itself in all of the scenarios you mentioned above and is not limited to one or two scenarios as implied
        – If you want larger buffers there are options (e.g. Arad from Broadcom)
        – VXLAN is native to Trident2 as well as several other ‘merchant’ ASICs
        – Merchant silicon is on pace with Moore’s Law (double transistor density every 24 months) custom ASIC development is dramatically longer. Its not a matter of if your future products will use merchant silicon its a matter of when.

  2. Tony, why in your ‘traditional virtualization’ bucket do you state the following: (adding my comments after each inline).

    Latency is not a huge concern (30 usecs not a big deal) –

    Agree

    Layer 2 adjacencies are mandatory (required for vMotion) –

    Disagree. According to VMware most of the current vMotion is not stageful, meaning the IP is not preserved and open sockets are disconnected during the migration. If this is the case why do you state it is mandatory?

    Large Layer 2 domains (thousands of hosts layer 2 adjacent).

    The maximum number of ESX hosts supported in vSphere 5.x is 1000. The average deployment often tops around 250 hosts. Why would I need ‘thousands of hosts Layer-2 adjacent’?

    Converged infrastructures (storage and data running on the same wires, FCoE, iSCSI, NFS, SMB3, etc.).

    Required? No. A better option, especially for storage that can actually traverse a routed bit-boundary – sure. iSCSI, NFS and SMB get good play here. As does any global file system frankly.

    Buffer requirements aren’t typically super high. Bursting isn’t much of an issue for most workloads of this type.

    Disagree – have a speed mismatch? What about the storage farm you just mentioned above? Backups? Or using the vMotion you mentioned above? That can drive 8-30Gb/s depending on core allocation in an Ivy Bridge based system. Buffers are for more than just a single bursty app, they are useful in any scenario you have incast (storage and backup being most common) or concurrency (VDI boot storm), or speed mismatch (10Gb Host, 40Gb uplink)

    Fibre Channel is often the storage protocol of choice, along with NFS and some iSCSI as well.
    Yes, either block based on FC or IP based on NFS/etc is quite common.

    • tonybourke says:

      Hi Doug,

      | Disagree. According to VMware most of the current vMotion is not stageful, meaning the IP is | not preserved and open sockets are disconnected during the migration. If this is the case
      | why do you state it is mandatory?

      That’s definitely not true. The whole point of vMotion (or Live Migration or whatever the platform term is for it) is to migrate a VM from one virtualization host to another without disturbing operations on the VM. This requires that the VM have access to the same Layer 2 domains on the hypervisor host it left and the hypervisor host it arrived on. There were cases I’ve heard of in vSphere where the (separate) VMkernel interfaces that transmit/recieve vMotions are on separate Layer 3 domains, but that didn’t negate the requirement where the virtual switch/dVS on each host needs to have access to the same Layer 2 domains.

      The IP of the VM is preserved, and all open sockets are not disconnected when using VMware vSphere, at least in the context of traditional virtualization. There are operational models where the application redundancy is built into the app stack and not so much the network/hypervisor, and in those cases Vmotion isn’t used. But that’s not traditional virtualization.

      | Large Layer 2 domains (thousands of hosts layer 2 adjacent).
      | The maximum number of ESX hosts supported in vSphere 5.x is 1000. The average
      | deployment often tops around 250 hosts. Why would I need ‘thousands of hosts Layer-2
      | adjacent’?

      When I’m talking hosts here, I’m talking VMs as well. Whether it’s a physical host or a VM, it’s all the same to a TCAM/CAM 🙂

      You can easily have hundreds to thousands of VMs and physical hosts on a network, and there are other features that may eat up TCAM space (ACLs, IPv6 taking up 4x the space, etc.) I think you’ll agree that TCAM space is important.

      In a routed Layer 3 network, or with TRILL/SPB-based network (Brocade VCS, Cisco FabricPath), the number of hosts each network switch needs to keep track of is usually a lot lower.

      |Disagree – have a speed mismatch? What about the storage farm you just mentioned
      |above? Backups? Or using the vMotion you mentioned above? That can drive 8-30Gb/s
      |depending on core allocation in an Ivy Bridge based system. Buffers are for more than just a
      |single bursty app, they are useful in any scenario you have incast (storage and backup being | most common) or concurrency (VDI boot storm), or speed mismatch (10Gb Host, 40Gb
      | uplink)

      A buffer will absorb only so much of a burst, and after that other Ethernet congestion mechanisms still need to take place (drop it or pause it). The Arista 7500E line cards have impressively large off-chip buffers, but I believe the buffer depth (if I’m not mistaken) is ~125 milliseconds, so about a tenth of a second before it’s overwhelmed. After that, then what? Drop it or pause it.

      If my late-night math is correct, Vmotioning an 8 GB VM will take about 7 seconds over a 10 Gbit link, assuming it has access to the entire 10 Gbit link. And if two hosts, operating at 10 Gbit/s are trying to vmotion to the same host, also at 10 Gbit second, those large buffers would help for only one half of that 1/10th of a second.

      Buffers won’t magically create more bandwidth. If you’re trying to send 11 Gbit/s of traffic to a 10 Gbit interface, something still has to give. Buffers can abate some of that, but not all of it. In the case of vMotion, buffers are almost irrelevant.

      If there’s congestion, then it needs to be accepted or addressed with either pausing (with QoS to try to prevent HoL blocking), or QoS (who lives and who dies!) or (preferably) adding more bandwidth.

      -Tony

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.