When some users suddenly can't get to the Internet, the cause is usually simple to nail down: a firewall is denying them, they have the wrong gateway, their DNS is wrong, a bad switch is in their path.. there are only so many reasons. The solution to this one turned out to be obvious after the fact, but it caused some hair tearing for a while.
Here's what happened: 15 PC's were relocated to another office. There was no new wiring, no new switches - nothing changed except the PC's were physically moved. But on Wednesday morning, these users started experiencing connectivity issues. The rest of the company had no problems; it was just these relocated users.
The symptoms were strange: sometimes they'd work for a while and then be unable to get access. It was baffling.
Pings from or too the firewall were erratic. Sometimes there would be no packet loss, sometimes 30%, sometimes 50%, sometimes 100%. My first thought was that the firewall port might be having NIC negotiation issues, so we set it and the switch port it uses to fixed speed. Amusingly, that seemed to fix the problem for several hours. In retrospect, that was probably coincidence.
But: another odd symptom was that when we'd do test pings from the firewall to machines having trouble, they'd sometimes start to work for a bit.
As "nothing had changed", I was sure we had to be looking at a flaky port somewhere. But it couldn't be the firewall as no other users had this problem. Something else was wrong, but I couldn't see it. They called in HP to look at their ProCurve switches and that finally gave us the answer: one of the moved PC's had the firewall's IP as a secondary!
I'll let their IT guy tell the story:
"We have a combination of workgroup and domain users here, and the domain machines are set to use the DC for DNS. All of the Workgroup users use the Kerio Firewall for DNS. When multiple users first started calling me on Wednesday morning complaining that they were unable to get online, I suspected that perhaps the DC DNS was acting up, so I decided to quickly try adding the Kerio firewall IP to the user's DNS on their NICs."
"Great idea, except that in my haste, it appears that I mistakenly added the Kerio firewall's IP NOT as a secondary DNS, but as a SECONDARY IP address on a user's NIC! (That's bad, right? .....RIGHT!!) (I thrashed myself repeatedly with the jackass boot, and rightfully so!)"
So, the users physically close to that machine could get its MAC address when they arp-ed for the firewall gateway. Why did my setting the port negotiation seem to fix it? I don't know: maybe that user had shut off that machine out of frustration and went to a long lunch!
This is all fixed now, though there are two things I don't understand still. First, what caused the original problem, before the IT guy made that mistake? Second, why was I sometimes seeing 30% packet drops when pinging one of these machines from the firewall? I would think it would be all or nothing, not partial. Wait - maybe I was unknowingly pinging the machine that had this bad secondary address? Could that cause this sort of symptom? I can't easily duplicate this in my office, so I just don't know. You'd need to set up something like this (pardon my clumsy drawing):
So, in this case it was the initial problem - whatever that was - that caused a "fix" that really broke things. That really clouded the issue for everyone. We should have listed arp tables on the problem machines, but then again the IT guy had probably done that before he went after DNS changes, so why would he do it again? This was just a bad chain of circumstances that caused great confusion.
If you found something useful today, please consider a small donation.
Got something to add? Send me email.
More Articles by Anthony Lawrence © 2015-04-07 Anthony Lawrence
Any problem in computer science can be solved with another level of indirection. (David Wheeler)