APLawrence.com -  Resources for Unix and Linux Systems, Bloggers and the self-employed

New Dell machine kills server

A new machine kills an old server (network problem). The customer said the machine had its correct IP address, but just wasn't accessing network resources.




A few days ago I had an email from a long time customer telling me that she had been trying to get a new Dell configured but it just couldn't seem to see the network. Her email said the machine was getting an ip address, but just wasn't accessing network resources. She said she'd had enough for the day, but would call me in the morning. I assumed this would be some some variation of a Windows authentication problem and didn't give it any more thought.

The next day's email brought a new message: mysteriously the machine had fixed itself overnight. Everything was fine, have a nice day, and so on. OK, great: a lot of problems do fix themselves, though I thought this was a little bit odd considering the symptoms described. Oh well, I had plenty of other work to do, including a programming project that has been incomplete for several weeks. I had just logged into that customer's machine to get reoriented with that code when the phone rang.

It was the customer with the misbehaving Dell. A new problem, she said: all the remote desktop clients are down. She had rebooted the Terminal Server. but clients still could not connect. I confirmed that by trying to connect from my Mac - no dice. I could ssh in to their Linux box though, so it wasn't their internet connection. Time to dig deeper.

After logging into Linux, I tried pinging the Terminal Server. No response. Unreachable. Dead. I told my customer that. "But it thinks everything is fine", she said, "except that no packets are flowing.."

Ok, maybe we have a bad switch port. I had her unplug the cable. The server immediately noticed that the cable was unplugged, but plugging it into a different port didn't help. I had her try a different switch entirely; no change. Hmmm.

Well, this server has another NIC that we don't currently use. I had her unconfigure the current card and transfer the ip to the other card. We switched the cable, but no change. Unreachable. Dead. Maybe a bad cable? Unlikely since it noticed plug/unplug events, but worth a try. I was about to suggest that when the customer said "It must have something to do with the new Dell"

Honestly, that seemed unlikely unless she had tried to configure that with the same ip or same network name. But I knew she hadn't. I asked her if that machine was running. She said no, but it was still plugged into the network. What the heck, unplug it, I said, not expecting any change from that action. To my surprise, the moment she unplugged it, the server responded to a ping. Plug it back in, no response. Unplug, all was fine. I tried the Remote Desktop; it came right up. Consistently repeatable, no question about it: the problem was this new Dell - the machine that wasn't even running!. She unplugged it again because users needed to get work done. As our work was solving this problem, we booted the new Dell (leaving it unplugged from the network) to see what we could see.

My immediate suspicion was that this card was a one in a million incorrect MAC address. Hardware addresses are supposed to be unique but screwups can happen, so I wanted to know what that new machine thought its NIC hardware address was. I knew what the Terminal Server's address was from "arp -an", so I just needed to get it from the new box.

Stupid #$!@% Windows! If the cable is unplugged, you can't get XP to give you the status of the connection. Device Manager doesn't bother to tell you that data at all, so that's unhelpful. Fortunately you can still get to a command line and "ipconfig/all" will give you the physical address. Idiots.

Anyway, that wasn't the problem. This machine really does have a unique and proper MAC address. So that's not it. I suppose it could be putting out incorrect voltage on the line and that is leaking to disrupt the server if its wiring is close by, but experimenting with that by moving the machine is just going to interrupt more work so we decided to let it be. I told her she could go buy another NIC, but that this could be a motherboard problem that might manifest itself somewhere else later, so my best advice was to get Dell to replace it. She agreed, though the employee who had been suffering with an old Windows 95 machine for years wasn't happy to see her new toy disappear so suddenly. But her old machine regained its place, and the network remained happy.

When all else fails, start unplugging. After last weeks bad storm here in the Northeast, I had a similar case where a server wouldn't come up because it insisted that it saw a duplicate name on the network. The customer checked every machine; there were no conflicts. I then had her unplug all network cables except the three servers. Rebooting the troubled server still gave the same message. We unplugged the other two servers. No change. In desperation, I had her unplug the router also. Still no change. At this point, there was nothing connected to the switch but this server. I had her move the cable to another switch, but the reboot still complained. Obviously there was something wrong with the card: it was seeing itself! We swapped in a new NIC card, and the problem went away.

Bad nics can do very strange things.



Got something to add? Send me email.





(OLDER)    <- More Stuff -> (NEWER)    (NEWEST)   

Printer Friendly Version

-> -> New Dell machine kills server


4 comments



Increase ad revenue 50-250% with Ezoic


More Articles by

Find me on Google+

© Anthony Lawrence







Wed Feb 1 15:03:05 2006: 1601   BigDumbDinosaur


I told her she could go buy another NIC, but that this could be a motherboard problem that might manifest itself somewhere else later, so my best advice was to get Dell to replace it.

This is actually a fairly common problem with Dell boxes and is caused by a defect in the transceiver that generates and receives the LAN signals. We usually "fix" the problem by disabling the onboard LAN hardware (which is a low quality to begin with) and installing a 3Com 3C905 type NIC. The owner is happy because s/he sees a substantial improvement in the machine's network performance, doesn't have the box out of service for several weeks waiting on Dell to R&R the motherboard and doesn't spend a small fortune to get the machine fixed. There's little likelihood of an outright motherboard failure, although if the machine is under warranty no reason to put more of your own money into it to fix a factory defect.

BTW, your client got what she "deserved" by buying Dell. I'm sure the low price was the deciding factor. You couldn't give me a Dell, let alone convince me to actually pay for one. We regularly get calls from Dell owners whose machines have simply quit working due to hard drive and power supply failures. In all cases, they are out of warranty but not by much. An inferior product, built from inferior parts, using an inferior processor, loaded with an inferior operating system (Win XP) and backed by inferior service. Repeat after me: there is no such thing as a good, cheap computer.



Wed Mar 8 14:37:02 2006: 1747   anonymous


We had this problem and we found out a solution. It seems the bmc features of the dell servers provide SNMP and remote control features even when the machine is powered off. To do this they keep the network interfaces alive. These interfaces in some cases have a different set of mac addresses than the "real" nic interfaces.

We found this using arpping under linux, and when we did we found that the first reply would come back with the correct mac, and then subsequent replies would have the mac increased by 2.

In our case we had 3 servers which always replied with the correct mac. Looking at the bmc controller showed the mac matched the mac of the nic. In another case, on 3 identical servers, the bmc controller mac addresses were exactly 2 higher than the real nics. This was throwing our router into confusion and causing connections to drop.

The fix was to set the IP address of the bmc controller (in the bios) to something different than the actual machine. The other option of course was to disable the bmc controller (snmp).

We wasted months trying to troubleshoot machines which would simply disappear and reappear on the network at random..... calling tech support was no help. We figured this one out on our own.







Wed Mar 8 14:38:47 2006: 1748   TonyLawrence

gravatar
Great! Thanks for sharing!



Sun Apr 5 02:11:22 2009: 5995   oldmacminiman

gravatar
What a bizarre circumstance ... never heard of adding a machine to the network, and have it take down the Terminal Server.

Dells, it seems, have become far crappier in the past few years. Every now and then a good build comes along, I guess ... I have a five-year-old Dell Inspiron 5100 laptop that's still running strong ...

------------------------
Kerio Samepage


Have you tried Searching this site?

Unix/Linux/Mac OS X support by phone, email or on-site: Support Rates

This is a Unix/Linux resource website. It contains technical articles about Unix, Linux and general computing related subjects, opinion, news, help files, how-to's, tutorials and more.

Contact us





I may not have gone where I intended to go, but I think I have ended up where I needed to be. (Douglas Adams)

Yeah, it's obsolete, clunky, insecure and broken, but people still use this stuff (Tony Lawrence)












This post tagged: