I’ve lost track of the amount of times in my career that someone has said “How did you do that?”, “Wow that’s amazing!” or “I would have never figured that out”. My answer is typically that it involved a little bit of black magic – but then I follow it up with an actual explanation. For those less technical, some things really can seem like a bit of magic. Solved an outage in minutes? Knew about an impending issue before it happened? Yeah – that can be quite magical.
So I have a few posts that I plan on scattering here and there, which will cover some tips on how to become a networking magician. I will aim to provide detail behind some of the expert intuition and skills which can amaze and confuse others. Let’s get started with Black Magic Tip #1:
Monitor your network – No, really, just monitor your network. I don’t mean “Oh, there is a ping alert for that switch” – I mean pay attention. How will you ever know what an anomaly looks like, if you don’t have an internal baseline of what your network should look like? This tip really takes time, but I’ve found that it pays off in the long run.
I use a couple of open-source tools and applications, like Observium and SmokePing, to track metrics on my networks. I spend a quick 5-10 minutes each morning quickly skimming through the pretty graphs to get an idea of how we are performing today. About once every or every other week, I will spend a bit more time for a deeper dive into the metrics. However, the important thing here is not the time spent, but the fact that I look at these. In the back of my mind, I keep a mental note of the general averages for bandwidth, latency, packet loss, etc.
Once in a while, I might look at a graph and notice that something is a bit off. Defining the word ‘off’ in this sense is difficult. Maybe a router interface that averages 20Mb/s spiked to over 40Mb/s through the night. Maybe traffic was actually far lower than the average. Sometimes I might see a slight increase or drop in latency between a pair of data centers. Some of these things could mean absolutely nothing – but in many cases they are an indicator of something else.
As an example to this – A few weeks ago, I noticed that the average latency between two data centers had increased slightly, and SmokePing was reporting occasional packet loss of up to 5%. I also track historical traceroute tests – so when I reviewed those, I found that the upstream carrier’s route had changed about 2-3 hops out. No big issues – but I made a note of these findings. A few days later, we began experiencing a spike in packet loss between those two data centers. Rather than being caught completely off-guard, I already had all of the information I needed to work with the upstream carrier. Issue resolved – quickly, simply, and without wasting time during a network degradation event.
Let me just reiterate that I don’t expect everyone out there to stare at bandwidth graphs all day long – that’s not going to get you anywhere. However, we do need to spend a little bit of time giving our network the attention that it deserves, even if its just a quick check-up every day. Once you have a good idea of how things typically operate, it can be much simpler to pinpoint issues and get ahead of them – which means being resolved without wasting time.
Ever had someone claim you’ve performed magic? Tell me about your experiences in the comments!