Ode To MRTG
December 7, 2011 3 Comments
wasting time/procrastinating keeping up with current events on Twitter when I saw a tweet from someone with a familiar name, but I couldn’t quite place where I knew it from: Tobi Oetiker (@oetiker). Then it came to me. He’s the author of the fantastic MRTG, among other tools.
MRTG was my favorite trending utility back in the day. “But Tony, weren’t you a condescending Unix administrator back then, and isn’t MRTG a networking tool?” Yes, yes I was. But MRTG isn’t just for trending network links, you can use it to graph bandwidth in and out of servers as well as other metrics like CPU utilization, memory utilization, number of processes, etc. I had a whole set of standard metrics I would graph with MRTG, depending on the device.
Connection rate, open connections, and bandwidth for an F5 load balancer back when “Friends” was still on the air
MRTG combined with net-snmp (or in Window’s case, the built-in SNMP service) I could graph just about anything on the servers I was responsible for. This saved my ass so many times. Here’s a couple of examples where it saved me:
Customer: “We were down for 5 hours!”
Me: “No, actually your server was down for 5 minutes. Here’s the graph.”
Another customer: “Your network is slow!”
Me: “Our network graphs show very low latency and plenty of capacity. In addition, here’s a graph showing CPU utilization on your servers spiking to 100% for several hours at a time. It’s either time to expand your capacity, or perhaps look at your application to see why it’s using up so many resources.”
In the late 90s, I set up a huge server farm for a major music television network . As part of my automated installs, I included MRTG monitoring for every server’s switch port, server NIC, CPU, memory, as well as other server-relatied metrics. I also graphed the F5 load balancer’s various metrics for all of the VIPs (bandwidth, connection rate). Feeling proud of myself, I showed them to one of the customer’s technical executives thinking they’d look at it and say “oh that’s nice.”
Instead, he called me several times a day for a month asking me (very good) questions about what all the data meant. He absolutely loved it, and I never built a server farm without it (or something like it).
Plenty of tools can show you graphs, but MRTG and tools like it trend not just when you’re looking, but when you’re not. When you’re sleeping, it collects data. When you’re out to lunch, it’s collecting data. When you’re listening to the Beastie Boys or whoever the kids are listening to these day, it collects data. Data that you can pull up at a later date. MRTG was fairly simple, but extremely powerful.
MRTG taught me several important lessons with respect to system monitoring. Perhaps the most important lesson is that monitoring is really two very different disciplines: Trending and alerting. A mistake a lot of operations made was confusing the two. Probably the biggest difference between trending and alerting is that with trending, you can never do too much. With alerting, it’s very easy to over-alert.
How many times have you, in either a server or network administrator role, been the victim of “alert creep”? When alarm after alarm is configured in your network monitoring tool, sending out emails and traps, until you’re so inundated with noise that you can’t tell the difference between the system crying wolf and a real issue?
It’s easy to over-alert. However, it’s very difficult to over-trend. And honestly, trending data is far more useful to me than 99% of alerting. Usually a customer is my best alerting mechanism, they almost always seem to know well before my monitoring system does. And having historical trending data helps me get to the bottom of things much quicker.
Many have improved upon the art of trending with tools like Observium and even RRDTool (also written by Tobi Oetiker). Many more tried but succeeded in only making overly complicated messes that ignored the strength of MRTG which was its simplicity. The simplicity of graphing and keeping various metrics and providing a simple way to get access to them when needed. MRTG was the first killer app for not only network administrators, but server administrators. And it proved how important the old adage is:
If you didn’t write it down, it didn’t happen.