If there is one thing you learn very quickly in networking, it’s that everything is always the fault of the network. Two systems cannot communicate? Yeah, that’s always a network problem. Something is inaccessible? Probably the network. What about that broken toaster? Definitely a network issue.
Starting off as a less experienced network engineer, this can easily get overwhelming. A ton of other teams blaming the network infrastructure for problems that might be entirely unrelated. However, it seems like a good part of your job will likely be dedicated to proving that the root cause of the problem isn’t the network.
I threw together a few tips that should help you get a better start to defending yourself against the angry sysadmins out there:
Get a good handle on the problem – Above all else, it’s extremely difficult to troubleshoot something without a good description of the problem – so get that first. Which systems are having the problem? Get host names or IP addresses. When did the problem start? Did it ever work? What behavior is being seen? Can it easily be replicated, so that you can watch logs in real-time? If not, get the last date/time that the issue occurred.
Know your infrastructure – Being a good network admin means understanding traffic flows and routing through your infrastructure. When someone says two systems are having a problem communicating, you should already have a good idea of what network components reside in the space between them. Are these on the same network segment? Is there a firewall (or multiple) in between? Does this traffic utilize a proxy or load balancer?
Check logs – Once you know which systems are in the way, check through logs for those systems. Particularly with firewalls, do your best to filter out logs to see the traffic calls between each system. Seeing ports blocked or traffic being dropped? A lot of firewall platforms will include enough detail in the traffic logs to quickly identify the issue, if it is in fact a network problem.
The basics are still important: Check TCP flags – This one has honestly saved me more often than not. Two systems aren’t establishing a connection, and the sysadmin says they are just receiving a “connection timeout” error. Check through the firewall logs – Yeah, we see the typical TCP handshake – but then the remote system sends back a TCP RST to the client. In most cases, this means the connection is actually succeeding from a network perspective. However, the target system is getting something that it doesn’t like from the client, so the application kills the session. Same thing goes for a client system sending the RST.
Wireshark – A lot of people see this as the nuclear option. All else has failed, so we have to resort to a packet capture. I used to think this way too until about a year or two ago. The vast amount of information within a full packet capture can easily be overwhelming – but once you get a handle on how to read it, it can also be incredibly useful. Raw packet captures don’t lie – and all the information you need is within those details. Start a capture, reproduce the issue, then analyze the results.
Be patient, and explain your defense – Not everyone is a network admin, and a lot of IT professionals don’t necessarily have a good grasp of how networking truly works. So once you’ve gathered your defense, be ready to explain it in a way that the other party will understand clearly. There is a huge difference from saying “I see TCP RST packets” to trying “Looks like the connection is succeeding, but the server-side system is resetting the connection”. Some people won’t want to admit that the problem actually exists with their system either, so be patient and work with them while they figure it out.
Bonus: Know the application – In some of my previous jobs, I was responsible for all systems and applications in the environment – even through I was primarily focused on networking. This experience has helped a ton, because even today I can still speak to how some applications work. I have installed and configured VMware ESX, Windows Server, backup and replication products, and much more. So when an application administrator says they are seeing a particular issue with something, I am more easily able to troubleshoot since I have a basic understanding of the applications they’re working with and how those applications communicate. This certainly isn’t a required skill – but it does help speed up troubleshooting efforts and minimize confusion around what’s going on with the application.
Have any other tips you would like to share? Throw them in the comments below!