(Finnish translation by Jukka Paulin also available)
The very first part of troubleshooting is identifying the problem.
That's not always easy even for skilled professionals. It's
definitely not easy for the typical computer user, so when you get
the call (we're assuming that you are the professional who gets
called with problems), what you are told may not match reality.
This isn't to imply that users are stupid, or ignorant, or careless
(though some are all of those things), but simply that they may
misinterpret symptoms and miss seeing the real problem.
Professionals do the same thing. In my career I've had more than one telephone call where someone describes themselves as a competent Windows administrator but apologizes for not "knowing Unix". Sometimes we end up having an easy conversation where the problem really is simply that they need a little (sometimes a very little) Unix guidance to help them fix their issue. Sometimes it's a little more involved: they've hit a tough nut and they'd really have needed years of experience to have any hope of fixing things.
Sometimes it's not like that at all. More than once the immediate problem was a dead, non-booting machine. I don't mean that Linux or Unix was trying to load and failing along the way, I mean that you could push the power button and the lights would come on and that was it. Nothing more. No BIOS display, no disk spin up, no beeps, nothing. Just dead. And yet here we have a supposedly competent Windows support person asking me what to do. What's that have to do with me? It's not a Unix issue - we haven't got that far yet. It might become a Unix issue: if the hard drive has been damaged by whatever caused the stubborn nothingness being seen now, we might need a data recovery firm with knowledge of Unix/Linux file systems. Even if it's just a missing boot sector, repairing that certainly requires OS specific knowledge. But right now? This is a low level hardware issue. Maybe a failed motherboard, power supply or missing/unplugged cables. Whatever is going on, right now it has nothing to do with Linux or Unix.
If you are dealing with a non-computer savvy user, remember that they may not understand things that seem obvious to you. For example, the user may understand that the hard drive stores his operating system and files, but may not realize that the initial BIOS information that flashes by at boot doesn't come from there. So while you would think very differently about a machine that displays BIOS information but does not continue versus a machine that displays no BIOS data at all, the user of that machine might not. You need to interpret problem reports with an eye toward the reporters knowledge.
But you know that. You also know that if it is your suspicion that somebody did something they shouldn't have, the user may not be willing to admit to it. You are going to take everything with a big helping of salt, and decide for yourself what the problem is. After all, you are the professional.
OK. But professionals also misinterpret things. You probably know this too: what you think you know can hurt you more than what you don't know. Do I assume too much? Maybe so: I know I make mistakes like that, and I've sure seen other people do the same thing, but you could be different. If so, you can either skip the rest of this post or read it with relish while you savor your superiority.
I can remember the first bad troubleshooting mistake I made. It cost me a good customer - not because they were angry with me, but because I had them switch to hardware and software I did not support. I advised them to switch because I thought the OS and hardware they were running on had reached its performance limits. They were running a product called Glovia on a SCO Unix 80386 box. There were only about twenty users, but the background code had been getting more and more complicated over the years, and the system was slowing down badly. I tried increasing swap, adding more memory, and everything else that I could think of, but it kept getting worse. As their Glovia programmer was constantly adding new features, I assumed that these new routines were simply overtaxing the system: heck, I could see it in the sar reports: both the cpu and disk i/o were under excessive load. Basically, I just gave up and agreed with the advice they had received from Glovia and their programmer: upgrade to a big HP/UX system. They did, performance returned to acceptable levels, but because I didn't know much about HP/UX, another consultant took over my position. I felt good about it overall: I had done the right thing, and I had more clients than I needed anyway. All parties pleased, time to move on.
But I was very, very wrong. I assumed the increasing load was from the heavy new tasks being added weekly, so I just didn't look far enough. I had done some "ps" runs, but had missed seeing something very important. The clutter of Glovia processes blinded me: I didn't see the big lumbering elephant in the crowd of dancing lambs. What I missed was an MMDF process called "deliver". The reason I probably missed it was because I was looking for processes that were gaining time right now: I'd take two "ps" snapshots and "diff" them (this procedure is covered in more detail elsewhere). The processes that popped out had used cpu time between the snapshots. If I had been lucky, "deliver" would have been in that list, but my timing was unfortunate: although "deliver" was using a lot of cpu, it didn't happen to be sucking any at the times I happened to look.
I know this because I accidentally saved some printouts from that system. For some reason I had tucked them in my briefcase and forgot all about them. When I found them a few years later while searching for something else, I happened to take a quick glance, recognized where they were from, and immediately had an awful feeling in the pit of my stomach. I got that feeling because I noticed "deliver" and saw that it had a lot of accumulated time. In the intervening years, I had seen that at other SCO Unix jobs, and I knew what it meant. It meant that there were thousands, perhaps tens of thousands, of mail messages backed up on the system. That "deliver" process was trying desperately to run through them to see if they could now be delivered. It would do a lot of disk i/o and consume a lot of cpu in the process, and then it would go away until it was scheduled to run again. Eventually there were so many messages that it was almost always running - except when I ran my snapshots, of course. Just my luck, I guess - or more likely I just didn't run enough of them because I saw all those Glovia processes and "knew" they had to be the problem.
Why didn't the customer notice backed up email? Because it was root's messages that were being delayed (due to a lock file on root's mail folder) and nobody cared about root's mail. User's mail of course was getting slower and slower, but I took that as symptomatic rather than closer to causal. My loss: I saw what I expected to see, I didn't see anything else, and I gave away a good account years before I had to.
Have you tried Searching this site?
Unix/Linux/Mac OS X support by phone, email or on-site: Support Rates
This is a Unix/Linux resource website. It contains technical articles about Unix, Linux and general computing related subjects, opinion, news, help files, how-to's, tutorials and more. We appreciate comments and article submissions.
Many of the products and books I review are things I purchased for my own use. Some were given to me specifically for the purpose of reviewing them. I resell or can earn commissions from the sale of some of these items. Links within these pages may be affiliate links that pay me for referring you to them. That's mostly insignificant amounts of money; whenever it is not I have made my relationship plain. I also may own stock in companies mentioned here. If you have any question, please do feel free to contact me.
Specific links that take you to pages that allow you to purchase the item I reviewed are very likely to pay me a commission. Many of the books I review were given to me by the publishers specifically for the purpose of writing a review. These gifts and referral fees do not affect my opinions; I often give bad reviews anyway.
We use Google third-party advertising companies to serve ads when you visit our website. These companies may use information (not including your name, address, email address, or telephone number) about your visits to this and other websites in order to provide advertisements about goods and services of interest to you. If you would like more information about this practice and to know your choices about not having this information used by these companies, click here.
Click here to add your comments
Don't miss responses! Subscribe to Comments by RSS or by Email
Click here to add your comments
If you want a picture to show with your comment, go get a Gravatar