I was helping someone troubleshoot a network problem last week. As it turned out, the problem was very physical: a light fixture had fallen down and loosened some connections to a switch. Both the existence and the location of that switch in relation to the rest of the network had vanished from institutional memory so much of the testing we were doing didn't make sense and was hard to interpret. In fact, it was so puzzling that I drove on-site with more sophisticated testing equipment, only to find that someone had noticed the fallen fixture before I arrived. Problem identified and solved, but it did prompt me to look for a general "Network Troubleshooting" page here and I was a bit surprised not to find one.
Oh, sure, there is a lot here about network problems, troubleshooting and solutions. I also have a chapter in my Unix/Linux Troubleshooting E-book but even that isn't quite what I wanted. This post attempts to provide that resource. It is a general overview with links to more detailed information.
If you are going to do network troubleshooting, you need basic knowledge and basic tools. If you don't understand basic TCP/IP communications and routing, that's where you need to start. Here are some articles here that can help:
If you aren't sure of your understanding, you might at least review those articles. You might also want to look at these - if you don't understand them, you need to back up and do some review:
- Detective: Gateway Down
- Troubleshooting network connections with arp
- Who said that?
- DNS Troubleshooting
- Slow Firefox DNS Mac OS X Leopard
- Testing for network connectivity in a script
- DNS Puzzler
Experience and Expectations
You also need a little real world experience. For example, I was helping someone with a strange problem over the phone. We checked a ping between the two machines - the response time was 300 milliseconds. That's not particularly fast, but it certainly could be possible. However, these two machines were on the same local network - the response time should have been a fraction of a millisecond, not nearly a third of a second! The funny thing about that was that it was so far off that neither of us caught it instantly - because we were expecting something like .319 ms, seeing 300 ms didn't immediately seem wrong.
Slowness, intermittent problems vs. loss of connectivity
Obviously these are very different problems. However, very slow connections can be misinterpreted as no connection - the client or user may give up before the underlying network actually has. A 300 ms delay on a local network could contribute to that.
If you had no idea that 300 ms is ridiculous, you would have missed something important here. You obviously (I hope it's obvious) have to know the local infrastructure too - is this wireless, 10Mbs, 100 Mbs or gigabit? If you don't know, you can't tell if ping responses are at least in the ballpark.
If ping does seem odd, what does traceroute (Windows "tracerte") show? It might show an unexpected packet path (that's what was happening with the 300 ms pings).
Ping is ICMP, not TCP. Ping can work when TCP protocols do not and vice versa. You may want to test transferring files (especially if transferring files is what brought up the idea that we have a network problem). Again, you have to have some idea of how fast data should travel. It's hard to be precise (too many variables), but tools like this download calculator or charts such as found at Explaining Network Speeds can be helpful.
You might have a physical problem. Test for a bad port on a switch by swapping patch cables or swapping out the whole switch. Wires in the wall can be chewed, burned, stretched or just badly done to start with. This is where you need some inexpensive testing equipment.
That tester on the left cost me a few hundred dollars, but you don't need to get quite that fancy. You need to be able to verify that the ends are properly wired and it's good to have at least minimal length testing capability. You don't need great accuracy; if you know a cable should be about 20 feet long, if your tester says it's 150, you have a problem.
A tone generator is very inexpensive and very helpful for tracing wiring.
If you don't have a tester, a visual inspection can sometimes spot gross errors. If you take a properly made CAT-5 ethernet cable and hold both ends of it so the plugs are side by side, all the colors should run the same left to right. If they run in opposite directions, that's a crossover cable - not what you usually want unless you are connecting two computers to each other.
Incomplete or very wrong wiring is all too common, even from professional installers.
Because we often have mixed 10/100/1000 networks, it's possible for two devices to misnegotiate their connection. See I don't understand all this half/full duplex negotiation stuff for an introduction and overview.
In addition to all of the reasons mentioned so far, we have to add bad NIC hardware and faulty drivers. In addition to outright "bad", I'd also like to mention "cheap": it astonishes me when I see a server loaded with ram, sporting the biggest and fastest cpu and drives, protected with dual power supplies, extra fans and anything else you can think of - and the NIC card is a $10 piece of junk! I've solved many a network problem by replacing junk.
Take advantage of the blinking lights. Every network device has them. If the lights are out or it has orange or red lights, it is possible that either the device is bad, not connected properly, or is not receiving a signal from the network.
Every system, Windows or Linux, has tools that can help you diagnose network problems. Here are the basics you should know about:
Ping: (Windows and Unix/Linux). Note that by default, Unix/Linux ping doesn't stop - you need to either tell it only to do so many pings or interrupt it (usually CTRL-C but often Delete on SCO Unix).
Netstat: (W and U/L). Note that Windows netstat has a "-b" flag which will display the executable responsible for each connection. If you add "-v", it "will display sequence of components involved in creating the connection or listening port for all executables. Before switches were common, we'd be looking for collisions and any other errors and wouldn't be at all surprised to see at least a few. Today, we wouldn't expect to see anything
Netsh:(W) http://www.microsoft.com/resources/documentation/windows/xp/all/proddocs/en-us/netsh.mspx. This is both a command line tool and a (weak) GUI tool that can be very helpful if you don't have Unix tools available. See Introduction to Netsh .
ifconfig (U/L), ipconfig (W) (also see Mac OS X ipconfig). Win 95/98: winipcfg.
traceroute (U/L), tracerte (W) (also see Windows "pathping").
nslookup(W, UL), dig, host (U/L). See Dig and Host.
lsof (U/L) Specifically, "lsof -i:25" would show you processes using TCP port 25. The -b switch to Windows netstat gives similar information. Although Mark Russinovich's Windows Process Explorer is a more general purpose tool, but it can be useful for network issues; see, for example, The Case of the Slow Logons.
telnet (W, U/L). First, don't confuse using telnet to test networks with running a telnet daemon. Telnet is available even when you are NOT accepting telnet connections (but see the Vista/Win 7 note below). You can use telnet to test network problems:
Special note on Vista and Win 7: Microsoft disables the telnet client on these operating systems. To correct that stupidity:
- Open Control Panel -> Programs.
- Click "Turn windows features on or off" . On the list that appears, check the box "Telnet Client".
- Click OK. You now have "telnet"
Special note on testing ssh, scp, telnet, rcp, ftp and similar protocols: I see people fighting this type of problem often. The very first thing you should do is test to see if it works locally. If you can't ssh to a server that you do have keyboard access to, the first thing you should test is "ssh localhost" ( or "ftp localhost", etc.). If you can't do that, there's no point in looking for firewall or security configuration issues.
If the protocol doesn't have diagnostics like "ssh -v", using "telnet" to the port can tell you whether it is blocked outright or something within the protocol is blocking you. For example. if "telnet somebox 21" returns
Trying x.x.x.x Connect to somebox Escape character is '^]'. 220 (vsFTPd 2.0.5)
you are NOT being blocked by a firewall.
Network printer issues can be approached the same way. LPD/LPR runs on port 515, IPP is port 631, HP is 9100, and some other devices are listed at Print Server Port Numbers
Don't confuse DNS resolver issues with actual network problems. Don't confuse application problems with network problems.
Special note on DNS caching:
ipconfig /flushdns (Windows)
dscacheutil -flushcache (OS X)
/etc/init.d/nscd restart (Linux)
(Note: Unix systems running "bind" are also cacheing DNS)
There are conflicting desires here: some things time out too quickly, some don't time out quickly enough. See Keep in touch (tcp keepalives etc.) for a discussion/
All of the above is basic and simple. Unfortunately, not all networking problems can be found that easily. Sometimes, you need to get down and examine network packets. See, for example, Lan sniffing with a DualComm port mirroring switch, tcpdump and Windump.
As I said at the beginning, this is meant to be a page that will guide you to the things you need to understand to be successful. Troubleshooting is mostly common sense, though apparently it's not all that common. Most troubleshooting failures come from ignorance: you don't know how the arp cache works, you don't undertand DNS, basic subnets, routing or something equally basic. It's rare for problems to require esoteric knowledge - it happens, but it is rare.
I hope this helped get you started with whatever newtworking problem is vexing you. Please feel free to leave comments with suggestions or questions.
You might also be interested in another article I wrote about basic TCP/IP troubleshotting.
(OLDER) <- More Stuff -> (NEWER) (NEWEST)
Printer Friendly Version