June 12, 2013 Leave a comment
April 18, 2013 4 Comments
In case you haven’t heard, Brocade is rebranding their 16 Gbit Fibre Channel offerings as Generation 5 Fibre Channel. Upcoming 32 Gbit Fibre Channel will also be called “Gen 6 Fibre Channel”. Seriously.
Brocade is trying to de-emphasize speed as the primary differentiator to a specific Fibre Channel technology, which is weird, since that’s by far the primary differentiator between the generations. This strategy has two major flaws as I see it:
Flaw #1: They’re trying to make it look like you can solve a problem that you really can’t with 16 Gbit FC. Whether you emphasize speed or other technological aspects of 16 Gbit Fibre Channel, 16 Gbit/Gen 5 isn’t going to solve any of the major problems that currently exists in the data center or storage for that matter, at least for the vast majority of Fibre Channel installations. Virtualization workloads, databases, and especially VDI are thrashing our storage systems. However, generally speaking (always exceptions) we’re not saturating the physical links. Not on the storage array links, not on the ISLs, and definitely not on the server FC links. Primarily the issue we face in the data center are limitations are IOPS.
The latency differences between Fibre Channel speeds is insignificant compared to the latencies introduced by overwhelmed storage arrays
Or, no wait, guys. Guys. Guys. Check out the… choke point.
16 Gbit can give us more throughput, but so can aggregating more 8 Gbit links, especially since single flows/transactions/file operations aren’t likely to eat up more than 8 Gbit (or even a fraction of that). There’s a lower serialization delay and lower latency associated with 16 Gbit, but that’s minuscule compared to the latencies introduced by storage systems. The vast majority of workloads aren’t likely to see significant benefit moving to 16 Gbit. So for right now, those in the storage world are concentrating on the arrays, and not the fabric. And that’s where they should be concentrating.
From one of Brocade’s posts, they mention this of Gen5 Fibre Channel:
”It’s about the innovative technology and unique capabilities that solve customer challenges.”
Fibre Channel is great. And Brocade has a great Fibre Channel offering. For the most part, better than Cisco. But there isn’t any innovation in this generation of Fibre Channel other than the speed increase. I’m kind of surprised Brocade didn’t call it something like “CloudFC”. This reeks of cloud washing, without the use of the word cloud. I mean, it’s Fibre Channel. It’s reliable, it’s simple to implement, best practices are easily understood, and it’s not terribly sexy, and calling it Gen5 isn’t going to change any of that.
Flaw #2: It creates market confusion.
Cisco doesn’t have any 16 Gbit Fibre Channel offerings (they’re pushing for FCoE, which is another issue). And when they do get 16 Gbit, they’re probably not going to call it Gen 5. Nor is most of the other Fibre Channel vendors, such as Emulex, Qlogic, NetApp, EMC, etc. HP and Dell have somewhat gone with it, but they kind of have to since they sell re-branded Brocade kit (it’s worth noting that even HP’s material is peppered with the words “16 Gbit”). So having another term is going to cause a lot of unnecessary conversations.
Here’s how I suspect a lot of Brocade conversations with new and existing customers will go:
“We recommend our Gen 5 products”
“What’s Gen 5?”
“It’s 16 Gbit Fibre Channel”
“OK, why didn’t you just say that?”
This is what’s happened in the load balancing world. A little over six years ago, Gartner and marketing departments tried to rename load balancers to “Application Delivery Controllers”, or ADCs for short. No one outside of marketing knows that the hell an ADC is. But anyone who’s worked in a data center knows what a load balancer is. They’re the same thing, and I’ve had to have a lot of unnecessary conversations since. Because of this, I’m particularly sensitive to changing the name of something that everyone already knows of for no good frickin’ reason.
Where does that leave Fibre Channel? For the challenges that most organizations are facing in the data center, an upgrade to 16 Gbit FC would be a waste of money. Sure, if given the choice between 8 and 16 Gbit FC, I’d pick 16. But there’s no compelling reason for the vast majority of existing workloads to convert to 16 Gbit FC. It just doesn’t solve any of the problems that we’re having. If you’re building a new fabric, then yes, absolutely look at 16 Gbit. It’s better to have more than to have less of course, but the benefits of 16 Gbit probably won’t be felt for a few years in terms of throughput. It’s just not a pain point right now, but it will be in the future.
In fact, looking at most of the offerings from the various storage vendors (EMC, NetApp, etc.), they’re mostly content to continue to offer 8 Gbit as their maximum speed. The same goes for server vendors (though there are 16 Gbit HBAs now available). I teach Cisco UCS, and most Cisco UCS installations plug into Brocade fabrics. Cisco UCS Fibre Channel ports only operates at a maximum of 8 Gbit, and I’ve never heard a complaint regarding the lack of 16 Gbit. Especially since you can use multiple 8 Gbit uplinks to scale connectivity.
February 2, 2013 3 Comments
Congestion happens. You try to put a 10 pound (soy-based vegan) ham in a 5 pound bag, it just ain’t gonna work. And in the topsy-turvey world of data center switches, what do we do to mitigate congestion? Most of the time, the answer can be found in the wisdom of Snoop Dogg/Lion.
Of course, when things are fine, the world of Ethernet is live and let live.
We’re fine. We’re all fine here now, thank you. How are you?
But when push comes to shove, frames get dropped. Either the buffer fills up and tail drop occurs, or QoS is configured and something like WRED (Weight Random Early Detection) kicks in to proactively drop frames before taildrop can occur (mostly to keep TCP’s behavior from causing spiky behavior).
The Bit Grim Reaper is way better than leaky buckets
Most congestion remediation methods involve one or more types of dropping frames. The various protocols running on top of Ethernet such as IP, TCP/UDP, as well as higher level protocols, were written with this lossfull nature in mind. Protocols like TCP have retransmission and flow control, and higher level protocols that employ UDP (such as voice) have other ways of dealing with the plumbing gets stopped-up. But dropping it like it’s hot isn’t the only way to handle congestion in Ethernet:
Please Hammer, Don’t PAUSE ‘Em
Ethernet has the ability to employ flow control on physical interfaces, so that when congestion is about to occur, the receiving port can signal to the sending port to stop sending for a period of time. This is referred to simply as 802.3x Ethernet flow control, or as I like to call it, old-timey flow control, as it’s been in Ethernet since about 1997. When a receive buffer is close to being full, the receiving side will send a PAUSE frame to the sending side.
Too legit to drop
A wide variety of Ethernet devices support old-timey flow control, everything from data center switches to the USB dongle for my MacBook Air.
One of the drawbacks of old-timey flow control is that it pauses all traffic, regardless of any QoS considerations. This creates a condition referred to as HoL (Head of Line) blocking, and can cause higher priority (and latency sensitive) traffic to get delayed on account of lower priority traffic. To address this, a new type of flow control was created called 802.1Qbb PFC (Priority Flow Control).
PFC allows a receiving port send PAUSE frames that only affect specific CoS lanes (0 through 7). Part of the 802.1Q standard is a 3-bit field that represents the Class of Service, giving us a total of 8 classes of service, though two are traditionally reserved for control plane traffic so we have six to play with (which, by the way, is a lot simpler than the 6-bit DSCP field in IP). Utilizing PFC, some CoS values can be made lossless, while others are lossfull.
Why would you want to pause traffic instead of drop traffic when congestion occurs?
Much of the IP traffic that traverses our data centers is OK with a bit of loss. It’s expected. Any protocol will have its performance degraded if packet loss is severe, but most traffic can take a bit of loss. And it’s not like pausing traffic will magically make congestion go away.
But there is some traffic that can benefit from losslessness, and and that just flat out requires it. FCoE (Fibre Channel of Ethernet), a favorite topic of mine, requires losslessness to operate. Fibre Channel is inherently a lossless protocol (by use of B2B or Buffer to Buffer credits), since the primary payload for a FC frame is SCSI. SCSI does not handle loss very well, so FC was engineered to be lossless. As such, priority flow control is one of the (several) requirements for a switch to be able to forward FCoE frames.
iSCSI is also a protocol that can benefit from pause congestion handling rather than dropping. Instead of encapsulating SCSI into FC frames, iSCSI encapsulates SCSI into TCP segments. This means that if a TCP segment is lost, it will be retransmitted. So at first glance it would seem that iSCSI can handle loss fine.
From a performance perspective, TCP suffers mightily when a segment is lost because of TCP congestion management techniques. When a segment is lost, TCP backs off on its transmission rate (specifically the number of segments in flight without acknowledgement), and then ramps back up again. By making the iSCSI traffic lossless, packets will be slowed down during congestions but the TCP congestion algorithm wouldn’t be used. As a result, many iSCSI vendors recommend turning on old-timey flow control to keep packet loss to a minimum.
However, many switches today can’t actually do full losslessness. Take the venerable Catalyst 6500. It’s a switch that would be very common in data centers, and it is a frame murdering machine.
The problem is that while the Catalyst 6500 supports old-timey flow control (it doesn’t support PFC) on physical ports, there’s no mechanism that I’m aware of to prevent buffer overruns from one port to another inside the switch. Take the example of two ingress Gigabit Ethernet ports sending traffic to a single egress Gigabit Ethernet port. Both ingress ports are running at line rate. There’s no signaling (at least that I’m aware of, could be wrong) that would prevent the egress ports from overwhelming the transmit buffer of the ingress port.
Many frames enter, not all leave
This is like flying to Hawaii and not reserving a hotel room before you get on the plane. You could land and have no place to stay. Because there’s no way to ensure losslessness on a Catalyst 6500 (or many other types of switches from various vendors), the Catalyst 6500 is like Thunderdome. Many frames enter, not all leave.
Catalyst 6500 shown with a Sup2T
The new generation of DCB (Data Center Bridging) switches, however, use a concept known as VoQ (Virtual Output Queues). With VoQs, the ingress port will not send a frame to the egress port unless there’s room. If there isn’t room, the frame will stay in the ingress buffer until there’s room.If the ingress buffer is full, it can have signaled the sending port it’s connected to to PAUSE (either old-timey pause or PFC).
This is a technique that’s been in used in Fibre Channel switches from both Brocade and Cisco (as well as others) for a while now, and is now making its way into DCB Ethernet switches from various vendors. Cisco’s Nexus line, for example, make use of VoQs, and so do Brocade’s VCS switches. Some type of lossless ability between internal ports is required in order to be a DCB switch, since FCoE requires losslessness.
DCB switches require lossless backplanes/internal fabrics, support for PFC, ETS (Enhanced Transmission Selection, a way to reserve bandwidth on various CoS lanes), and DCBx (a way to communicate these capabilities to adjacent switches). This makes them capable of a lot of cool stuff that non-DCB switches can’t do, such as losslessness.
One thing to keep in mind, however, is when Layer 3 comes into play. My guess is that even in a DCB switch that can do Layer 3, losslessness can’t be extended beyond a Layer 2 boundary. That’s not an issue with FCoE, since it’s only Layer 2, but iSCSI can be routed.
November 1, 2012 2 Comments
Ever since I first had a device boot via SSD, I’ve been a huge fan and proponent. I often say SSDs enjoy the Charleton Heston effect: “You’ll pull my SSD out of my cold, dead hands.”
They’re just absolutely fantastic for desktop operating systems. Nothing you can do will make your desktop or laptop respond faster than adding an SSD for boot/applications. Even a system a couple years old with an SSD will absolutely run circles around a brand new system that’s still rocking the HDD.
And the prices? The prices are dropping faster than American Airline’s reputation. Currently you can get great SSDs for less than $1 per gig. Right now the sweet spot is a 256 GB SSD, though the 480/512 GB are coming down as well.
Desktop operating systems are very I/O intensive, especially with respect to IOPS, and that’s where SSDs shine. Your average 5400 RPM laptop drive gives about 60 IOPS, while a decent SSD gives you about 20,000 (more for reads). So unless you’re going to strap 300+ drives to your laptop (man your battery life would suck), you’re not going to get the same performance as you would on an SSD. Not even close. And it doesn’t matter if you’re SATA 2 or SATA 3 on your motherboard (or even SATA 1), the SSD’s primary benefit of super-IOPs won’t be restricted by SATA bandwidth.
So right now there are two primary drawbacks: Costs a bit more and the storage is less than you would get with a HDD. But boy, do you get the IOPS.
However, lately I’ve heard a few people express hesitance (and even scorn) towards SSD. “When you have an SSD go tits up, then you’ll wish you had a hard drive” is something I’ve heard recently.
Three of the biggest issues I see are:
1: Fear of running out of writes: SSDs have a limited write lifespan. Each cell can only be written to a number of times, and when that limit is reached, the cell is read-only. Modern SSD controllers do tricks like wear leveling
2: Data retrieval: If the SSD fails, there are no methods for retrieving data. There are lots of ways you can attempt to recover data from a failed disk of spinning rust (though nothing guaranteed), but no such options exist for SSDs that I’m aware of.
3: SSDs lie: SSDs do lie to you. They tell you that you wrote to a particular block that doesn’t actually correspond to a physical cell like it would a sector/track on a physical drive. This is because SSDs do wear-leveling, to ensure the longest possible lifespan of the SSD. Otherwise the blocks where the swap is stored would wear out far quicker than the rest of the drive. Our file systems (NTFS, Ext4, even ZFS) were all built on the abilities and limitations of spinning rust, and haven’t caught up to flash memory. As a result, the SSD controller has to lie to us, and pretend it’s a spinning disk.
Here’s a few things to keep in mind.
1: Yes, SSDs have a limited lifespan. The Crucial M4 has a limited write life of 36 TB, which is 20 GB a day for five years. You probably don’t write that much data to your SSD every day. And the worst that happens when your drive reaches the limit is that it becomes read-only. I don’t trust HDDs that are older than 4 or 5 years anyway.
2: True, if your SSD fails, there’s little chance of recovery (while there’s some chance of recovery if it’s a HDD). This highlights the need for a decent backup mechanism. Don’t let the chance that you could retrieve data from a HDD be your backup plan.
3: Yes, SSDs lie. So do HDDs.
I still use HDDs for media storage, backups, and archival. But apps and OS, that’s definitely going to sit on an SSD from now on. It’s just too awesome. And if that means I have to swap them out every 5 years? I’m fine with that.
May 21, 2012 2 Comments
In a recent post, I took a look at the Fibre Channel subjects of NPIV and NPV, both topics covered in the CCIE Data Center written exam (currently in beta, take yours now, $50!). The post generated a lot of comments. I mean, a lot. Over 50 so far (and still going). An epic battle (although very unInternet-like in that it was very civil and respectful) brewed over how Fibre Channel compares to Ethernet/IP. The comments look like the aftermath of the battle of Wolf 359.
Captain, the analogy regarding squirrels and time travel didn’t survive
One camp, lead by Erik Smith from EMC (who co-wrote the best Fibre Channel book I’ve seen so far, and it’s free), compares the WWPNs to IP addresses, and FCIDs to MAC addresses. Some others, such as Ivan Pepelnjak and myself, compare WWPNs to MAC addresses, and FCIDs to IP addresses. There were many points and counter-points. Valid arguments were made supporting each position. Eventually, people agreed to disagree. So which one is right? They both are.
Wait, what? Two sides can’t be right, not on the Internet!
When comparing Fibre Channel to Ethernet/IP, it’s important to remember that they are different. In fact, significantly different. The only purpose for relating Fibre Channel to Ethernet/IP is for the purpose of relating those who are familiar with Ethernet/IP to the world of Fibre Channel. Many (most? all?) people learn by building associations with known subjects (in our case Ethernet/IP) to lesser known (in this case Fibre Channel) subjects.
Of course, any association includes includes its inherent inaccuracies. We purposefully sacrifice some accuracy in order to attain relatability. Specific details and inaccuracies are glossed over. To some, introducing any inaccuracy is sacrilege. To me, it’s being overly pedantic. Pedantic details are for the expert level. Using pedantic facts as an admonishment of an analogy misses the point entirely. With any analogy, there will always be inaccuracies, and there will always be many analogies to be made.
Personally, I still prefer the WWPN ~= MAC/FC_ID ~= IP approach, and will continue to use it when I teach. But the other approach I believe is completely valid as well. At that point, it’s just a matter of preference. Both roads lead to the same destination, and that is what’s really important.
Learning always happens in layers. Coat after coat is applied, increasing in accuracy and pedantic details as you go along. Analogies is a very useful and effective tool to learn any subject.
February 15, 2012 2 Comments
When building any standalone server (a server without a SAN or NAS for storage), one of the considerations is how to handle storage. This typically includes a conversation about RAID, and making sure the local storage has some protection.
With ESXi, this is a bit trickier than most operating systems, since ESXi doesn’t do software RAID like you can get with Linux or Windows, nor does it support the motherboard BIOS RAID you get with most motherboards these days (which isn’t hardware RAID, just another version of software RAID).
So if you want to RAID out your standalone ESXi box, you’re going to need to purchase a supported hardware RAID card. These cards aren’t the $40 ones on Newegg, either. They tend to be a few hundred bucks (to a few thousands, depending).
Most people who are serious about building a serious ESXi server dig around and try to find a RAID card that will work, either buying new, scrounging for parts, or hitting up eBay.
My suggestion to you if you’re looking to put a RAID card in your standalone ESXi host, consider this:
Are you sure you need a RAID card?
The two primary reasons people do RAID is for data integrity (lose a drive, etc.) and for performance.
As far as data integrity goes, I find people tend make the same mistake I used to: They put too much faith in RAID arrays as a method to keep data safe. One of the most important lesson I’ve ever learned in storage is that RAID is not a backup. It’s worth saying again:
RAID Is Not A Backup
I’ve yet to have RAID save my soy bacon, and in fact in my case it’s caused more problems than its solved. However, I’ve been saved many times by a good backup. My favorite form of backup that doesn’t involve a robot? A portable USB drive. They’re high capacity, they don’t require a DC power brick, and easily stored.
Another reason to do RAID is performance. Traditional HDDs are, well, slow. They’re hampered by the fact they are physical devices. By combining multiple drives in a RAID configuration, you can get a higher number of IOPS (and throughput, but for virtual machines that’s typically not as important).
More drives, more IOPS.
A good hardware RAID card will also have a battery-backed up RAM cache, which while stupid fast, only works if you actually hit the cache.
But there’s the thing: If you need performance, you’re going to need a lot of hard drives. Like, a lot. Remember that SNL commercial from years ago? How many bowls of your regular bran cereal does it take to equal one bowl of Colon Blow Cereal? I’ve got an SSD that claims 80,000 IOPS. Assuming I get half that, I’d need about 500 hard drives in a RAID 0 array to get the same number of IOPS. And that’s without any redundancy. That’s a lot of PERC cards and a lot of drives.
So want performance? Why not ditch the PERC and spend that money on an SSD. Of course, SSDs aren’t as cheap as traditional HDD on a per gigabyte basis, so you’ll just want to put virtual disks on the SSD that can really benefit from. Keep your bulk storage (such as file server volumes) on cheap SATA drives, and back them up regularly (which you should do with or without a RAID array).
Another idea might be to spend the RAID card money on a NAS device. You can get a 4 or 5 bay NAS device for the price of a new RAID card these days, and they can be used for multiple ESXi hosts as well as other uses. Plus, they handle their own RAID.
Ideally of course, you want you server with RAID storage, ECC memory, IPMI or other out of band management, SSD data stores, a SAN, a backup system with a robot, etc. But if you’re building a budge box, I’m thinking the RAID card can be skipped.
December 12, 2011 2 Comments
Right now, all of my personal computers (yeah, I have a lot) now boot from SSD. I have a MacBook Pro, a MacBook Air, and a Windows 7 workstation, all booting from SSD. And the ESXi host I have will soon have an SSD datastore.
And let me reiterate what I’ve said before: I will never have a computer that boots from spinning rust again. The difference between a computer with an SSD and a computer with a HDD is astounding. You can take even a 3 year old laptop, put an SSD in there, and for the most part it feels way faster than the latest 17 inch beast running with a HDD.
Yeah yeah yeah, SSD from your cold, dead hands
So why are SSDs so bad-ass? Is it the transfer speeds? No, it’s the IOPS. The transfer speeds in SSDs are a couple of times better than an a HDD, but the IOPS are orders of magnatude better. And for desktop operating systems (as well as databases), IOPS are where it’s at. Check out this graph (bottom of page) comparing an SSD to several HDD, some of which run at 15,000 RPM.
As awesome an unicorny as that is, SSD storage still comes at a premium. Even with the spike in prices caused by the tragic flooding in Thailand, SSDs are still significantly more expensive per GB than HDDs. So it doesn’t make sense to make all of our storage SSD. There’s still a need for inexpensive, slow bulk storage, and that’s where HDDs shine.
But now that we have SSDs for speed, 7200 RPM is overkill for our other needs. I just checked my iTunes directory, and it’s 250 GB of data. There’s nothing that MP3 sound files, HD video files, backups, etc. need in terms of performance that would necessitate a 7200 RPM drive. A 5400 RPM drive will do just fine. You might notice the difference while copying files, but the difference won’t be that great when compared to a 7200 RPM drive. Neither are in any position to flood a SATA2 connection, let alone SATA3.
Even with those USB portable hard drives which have 5400 RPM drives in them, it’s still more than enough to flood USB 2.0.
And this got me thinking: How useful are 7200 RPM drives anymore? I remember taking a pair of hard drives back to Fry’s because I realized they were 5400 RPM drives (I wasn’t paying attention). Now, I don’t care about RPMs. Any speed will do for my needs.
Hard drives have been the albatross of computer performance for a while now. This is particularly true for desktop operating systems: They eat up IOPS like candy. A spinning disk is hobbled by the spindle. In data centers you can get around this by adding more and more spindles into some type of array, thereby increasing IOPS.
Enterprise storage is another matter. It’s not likely Enterprise SANs will give up spinning rust any time soon. Personally, I’m a huge fan of company’s like PureStorage and StorageFire that have all-SSD solutions. The IOPS you can get from these all-flash arrays is astounding.
November 3, 2011 13 Comments
One recurring theme from virtually every one of the Network Field Day 2 vendor presentations last week (as well as the OpenFlow symposium) was affectionately referred to as “The Problem”.
It was a theme because, as vendor after vendor gave a presentation, they essentially said the same thing when describing the problem they were going to solve. For us the delegates/bloggers, it quickly went from the problem to “The Problem”. We’d heard it over and over again so often that during the (5th?) iteration of the same problem we all started laughing like a group of Beavis and Butt-Heads during a vendor’s presentation, and we had to apologize profusely (it wasn’t their fault, after all).
In fact, I created a simple diagram with some crayons brought by another delegate to save everyone some time.
But with The Problem on repeat it became very clear that the majority of networking companies are all tackling the very same Problem. And imagine the VC funding that’s chasing the solution as well.
So what is “The Problem”? It’s multi-faceted and interrelated set of issues:
Virtualization Has Messed Things Up, Big Time
The biggest problem of them all was caused by the rise of virtualization. Virtualization has disrupted much of the server world, but the impact that it’s had on the network is arguably orders of magnitude greater. Virtualization wants big, flat networks, just when we got to the point where we could route Layer 3 as fast as we could switch Layer 2. We’d just gotten to the point where we could get our networks small.
And it’s not just virtualization in general, much of its impact is the very simple act of vMotion. VMs want to keep their IPs the same when they move, so now we have to bend over backwards to get it done. Add to the the vSwitch sitting inside the hypervisor, and the limited functionality of that switch (and who the hell manages it anyway? Server team? Network team?)
4000 VLANs Ain’t Enough
If you’re a single enterprise running your own network, chances are 4000+ VLANs are sufficient (or perhaps not). In multi-tenant environments with thousands of customers, 4000+ VLANs quickly becomes a problem. There is a need for some type of VLAN multiplier, something like QinQ or VXLAN, which gives us 4096 times 4096 VLANs (16 million or so).
Spanning Tree Sucks
One of my first introductions to networking was accidentally causing a bridging loop on a 10 megabit Ethernet switch (with a 100 Mbit uplink) as a green Solaris admin. I’d accidentally double-connected a hub, and I noticed the utilization LED on the switch went from 0% to 100% when I plugged a certain cable in. I entertained myself with plugging in and unplugging the port to watch the utilization LED flucutate (that is, until the network admin stormed in and asked what the hell was going on with his network).
And thus began my love affair with bridging loops. After the Brocade presentation where we built a TRILL-based Fabric very quickly, with active-active uplinks and nary a port in blocking mode, Ethan Banks became a convert to my anti-spanning tree cause.
OpenFlow offers an even more comprehensive (and potentially more impressive) solution as well. More on that later.
Layer 2 Switching Isn’t Scaling
The current method by which MAC addresses are learned in modern switches causes two problems: Only one viable path can be allowed at a time (only way to prevent loops is to prevent multiple paths by blocking ports), and large Layer 2 networks involve so many MAC addresses that it doesn’t scale.
From QFabric, to TRILL, to OpenFlow (to half a dozen other solutions), Layer 2 transforms into something Layer 3-like. MAC addresses are routed just like IP addresses, and the MAC address becomes just another tuple (another recurring word) for a frame/packet/segment traveling from one end of your datacenter to another. In the simplest solution (probably TRILL?) MAC learning is done at the edge.
There’s A Lot of Shit To Configure
Automation is coming, and in a big way. Whether it’s a centralized controller environment, or magical software powered by unicorn tears, vendors are chomping at the bit to provide some sort of automation for all the shit we need to do in the network and server world. While certainly welcomed, it’s a tough nut to crack (as I’ve mentioned before in Automation Conundrum).
Data center automation is a little bit like the Gom Jabbar. They tried and failed you ask? They tried and died.
“Pain. And an EULA that you must agree to. Also, man-years of customization. So yeah, pain.”
Ethernet Rules Everything Around Me
It’s quite clear that Ethernet has won the networking wars. Not that this is any news to anyone who’s worked in a data center for the past ten years, but it has struck me that no other technology has been so much as even mentioned as one for the future. Bob Metcalfe had the prophetic quote that Stephen Foskett likes to use: “I don’t know what will come after Ethernet, but it will be called Ethernet.”
But there are limitations (Layer 2 MAC learning, virtualization, VLANs, storage) that need to be addressed for it to become what comes after Ethernet. Fibre Channel is holding ground, but isn’t exactly expanding, and some crazy bastards are trying to merge the two.
Most people agree that storage is going to end up on our network (converged networking), but there are as many opinions on how to achieve this network/storage convergence as there are nerd and pop culture reference in my blog posts. Some companies are pro-iSCSI, others pro FC/NFS, and some like Greg Ferro have the purest of all hate: He hates SCSI.
So that’s “The Problem”. And for the most part, the articles on Networking Field Day, and the solutions the vendors propose will be framed around The Problem.
September 30, 2011 12 Comments
At Tech Field Day 8, we saw presentations from two vendors that had an all-flash SAN offering, taking on a storage problem that’s been brewing in data centers for a while now, and the skewed performance/capacity scale.
While storage capacity has been increasing exponentially, storage performance hasn’t caught up nearly that fast. In fact, performance has been mostly stagnant, especially in the area where it counts: Latency and IOPS (I/O Operations Per Second).
In modern data centers, capacity isn’t so much of an issue with storage. Neither is the traditional throughput metric, such as megabytes per second. What really counts is IOPS and latency/seek time. Don’t get me wrong, some data center applications certainly have capacity requirements, as well as potential throughput requirements, but for the most part these are easily met by today’s technology.
IOPS and latency are super critical for virtual desktops (and desktops in general) and databases. If you computer is sluggish, it’s probably not a lack of RAM or CPU, by and large it’s a factor of IOPS (or lack thereof).
There are a few tricks that storage administrators and vendors have up their sleeve to increase IOPS and drop latency.
In a RAID array, you can scale IOPS linerally by just throwing more disks at the array. If you have a drive that does 100 IOPS per second, add a second drive for a RAID 0 (mirror) and you’ve got double the IOPS. Add a third and you’ve got 300 IOPS (and of course add more for redundancy).
Another trick that storage administrators have up their sleeve is the technique known as “short stroking“, where only a portion of the drive is used. In a spinning platter, the outside is spinning the fastest, giving the best performance. If you only format that out portion, the physical drive head doesn’t have to travel as far. This can reduce seek time substantially.
Tiered storage can help with both latency and IOPS, were a combination of NVRAM, SSDs, and hard drives are combined.”Hot” data is accessed from high-speed RAM cache, “warm” data is on a bank of SSDs, and “cold” data would be stored on cheaper SAS or (increasingly) consumer SATA drives.
And still our demand for IOPS is insatiable, and the tricks in some cases aren’t catching up. Short stroking only goes so far, and cache misses can really impact performance for tiered storage. While IOPS scale linearly, the IOPS we need can sometimes end up with racks full of spinning rust, while only using a tenth of the actual capacity. That’s a lot of wasted space and wasted power.
And want to hear a depressing fact?
A high-end enterprise SAS 15,000 RPM drive (which spins faster than most jet engines) gives you about 150 IOPS in performance (depending on the workload of course). A good consumer grade SSD from Newegg gives you around 85,000 IOPS. That means you would need almost 600 drives to equal the performance of one consumer grade SSD.
That’s enough to cause anyone to have a Bill O’Reilly moment.
600 drives? Fuck it, we’ll do it with all flash!
No one is going to put their entire database or virtual desktop infrastructure on a single flash drive of course. And that’s where vendors like Pure Storage and SolidFire come into play. (You can see Pure Storage’s presentation at Tech Field Day 8 here. SolidFire’s can be seen here.)
The overall premise with the we’ll-do-it-all-in-flash play is that you can take a lot of consumer grade flash drives, use the shitload of IOPS that they bring, and combine it with a lot of storage controller CPU power for deduplication and compression. With that combination, they can offer an all-flash based array at the same price per gig as traditional arrays comprise of spinning rust (disk drives).
How many IOPS are we talking about? SolidFire’s SF3010 claims 50,000 IOPS per 1 RU node. That would replace over 300 drives of traditional drives, which I don’t think you can put in 1RU. Pure Storage claims 300,000 IOPS in 8U of space. With a traditional array, you’d need over 2000 drives, also unlikely to fit in 8 RU. Also, imagine the power savings, with only 250 watts needed for SoldFire’s node, and 1300 Watts for the PureStorage cluster. And Both allow you to scale up by adding more nodes.
You wire them into your SAN the traditional ways, as well. The Pure Storage solution has options for 10 Gbit iSCSI and 8 Gbit Fibre Channel, while the SolidFire solution is iSCSI only. (Sadly, neither support FCoE or FCoTR.)
For organizations that are doing virtual desktops or databases, an all-flash storage array with the power savings and monster IOPS must look more tantalizing than a starship full of green-skinned girls does to Captain Kirk.
There is a bit of a controversy in that many of the all-flash vendors will tell you capacity numbers with deduplication and compression taken into account. At the same time, if the performance is better than spinning rust even with the compression/dedupe, then who cares?
So SSD it is. And as anyone who has an SSD in their laptop or desktop will tell you, that shit is choice. Seriously, I get all Charlton Heston about my SSD.
It’s not all roses and unicorn-powered SSDs. There are two issues with the all-flash solution thus far. One is that they don’t have a name like NetApp, EMC, or Fujitsu, so there is a bit of a trust issue there. The other issue is that many have some negative preconceptions about flash, such as they have a high failure rate (due to a series of bad firmwares from vendors) and the limited write cycle of memory cells (true, but mitigaitable). Pure Storage claims to have never had a drive fail on them (Amy called them flash driver whispers).
Still though, check them (and any other all-SSD vendor) out. This is clearly the future in terms of high performance storage where IOPS is needed. Spinning rust will probably rule the capacity play for a while, but you have to imagine its days are numbered.
September 19, 2011 7 Comments
I was at Arista on Friday for Tech Field Day 8, and when FCoE was brought up (always a good way to get a lively discussion going), Andre Pech from Arista (who did a fantastic job as a presenter) brought up an article written by Douglas Gourlay, another Arista employee, entitled “Why FCoE is Dead, But Not Buried Yet“.
FCoE: “I feel happy!”
It’s an interesting article, because much of the player-hating seems to directed at TRILL, not FCoE, and as J Metz has said time and time again, you don’t need TRILL to do FCoE if you do FCoE the way Cisco does (by using Fibre Channel Forwarders in each FCoE switch). Arista, not having any Fibre Channel skills, can’t do it this way. If they were to do FCoE, Arista (like Juniper) would need to do it the sparse-mode/FIP-snooping FCoE way, which would need a non-STP way of handling multi-pathing such as TRILL or SPB.
Jayshree Ullal, The CEO of Arista, hated on TRILL and spoke highly of VXLAN and NVGRE (Arista is on the standards body for both). I think part of that is that like Cisco, not all of their switches will be able to support TRILL, since TRILL requires new Ethernet silicon.
Even the CEO of Arista acknowledged that FCoE worked great at the edge, where you plug a server with a FCoE CNA into an FCoE switch, and the traffic is sent along to native Ethernet and native Fibre Channel networks from there (what I call single-hop or no-hop FCoE). This doesn’t require any additional FCoE infrastructure in your environment, just the edge switch. The Cisco UCS Fabric Interconnects are a great example of this no-hop architecture.
I don’t think FCoE is quite dead, but I have to imagine that it’s not going as well as vendors like Cisco have hoped. At least, it’s not been the success that some vendors have imagined. And I think there are two major contributors to FCoE’s failure to launch, and both of those reasons are more Layer 8 than Layer 2.
Old Man of the Data Center
Reason number one is also the reason why we won’t see TRILL/Fabric Path deployed widely: It’s this guy:
Don’t let him trap you into hearing him tell stories about being a FDDI bridge, whatever FDDI is
The Catalyst 6500 series switch. This is “The Old Man of the Data Center”. And he’s everywhere. The switch is a bit long in the tooth, and although capacity is much higher on the Nexus 7000s (and even the 5000s in some cases), the Catalyst 6500 still has a huge install base.
And it won’t ever do FCoE.
And it (probably) won’t ever do TRILL/Fabric Path (spanning-tree fo-evah!)
The 6500s aren’t getting replaced in significant numbers from what I can see. Especially with the release of the Sup 2T supervisor for the 6500es, the 6500s aren’t going anywhere anytime soon. I can only speculate as to why Cisco is pursuing the 6500 so much, but I think it comes down to two reasons:
- The idea of “Don’t let the customer replace the chassis, lest they replace it with a competitor“
- Cisco is afraid of eating its young. Apple is the opposite, love ‘em or hate ‘em, they weren’t afraid to cannibalize a highly lucrative and profitable business (iPods) with an unproven (but now proven) product (iPhone). Cisco doesn’t have the guts to cannibalize the 6500 sales.
So reason number two? I think Cisco jumped the gun. They’ve been pushing FCoE for a while, but they weren’t quite ready. It wasn’t until July 2011 that Cisco released NX-OS 5.2, which is what’s required to do multi-hop FCoE in the Nexus 7000s and MDS 9000. They’ve had the ability to do multi-hop FCoE in the Nexus 5000s for a bit longer, but not much. Yet they’ve been talking about multi-hop for longer than it was possible to actually implement. Cisco has had a multi-hop FCoE reference architecture posted since March 2011 on their website, showing a beautifully designed multi-hop FCoE network with 5000s, 7000s, and MDS 9000s, that for months wasn’t possible to implement. Even today, if you wanted to implement multi-hop FCoE with Cisco gear (or anyone else), you’d be a very, very early adopter.
So no, I don’t think FCoE is dead. No-hop FCoE is certainly successful (even Arista’s CEO acknowedged as such), and I don’t think even multi-hop FCoE is dead, but it certainly hasn’t caught on (yet). Will multi-hop FCoE catch on? I’m not sure. We’ll have to see.