The Art of Troubleshooting
by Jemshad O K
Yes, troubleshooting is an art!. The key points to mastering this art is knowing the system in and out, using the right tools and, of course, googling. Troubleshooting a problem is not something that can be spoon fed or taught with precise steps. It has to evolve from logical thinking and thorough knowledge of the system.
This article is no capsule which you can swallow and say "I know troubleshooting". But this is just what I think will help you in debugging a problem. What is written below is mostly from my experience over the past few years.
The first step of debugging any problem is knowing the problem well. After we have found the problem, its time to fix it. But how?. There comes the importance of using the right tools and knowing how to use them effectively. In *nix world, system administration means playing with lot of commands and effective piping of them. Knowing the tools only is not the only requirement. We should know where to use them - the source of data. And, if everything else fails, lets hope we are not the only one facing the same problem, so what? - google it!. We will cover each of these in a bit detail. I will cite with examples whenever possible.
As I said earlier, understanding the problem is the key aspect of troubleshooting. Without knowing the actual problem, we will be just roaming around here and there with no result. Suppose we are running a script and it results in an error message. Here, that error message which the script gave is not the problem and we should not be chasing that. But find out what caused the script to give error message. The next section covers some tools which would help you here.
Data we have:
- A script
- An error message we get when running the script
Now our first job is to find out why the script resulted in error. 'grep' is a very good command to start with. So we do:
If we have a result, then our job is fairly simple - just go through the script, find what condition is failing and fix it. But, practically in most cases, this kind of a grep wont give any result - which means, the error message is not directly from the script. This is the importance of knowing the problem. Suppose the error message was something like 'username not found', but on checking shows that the username do exist. There, a non existent username is not our 'problem', its something else which we are yet to find out.
Now we have the second case, grep didn't give any output. Then we know that the script is calling some other script or binary which may be giving the error. If its a small script, pointing out the location is easier. But if we have large script with lot of branches throughout, we have to introduce check points (put some print lines here and there and find the exact location which gives the error).
After pin pointing the portion of the script giving the error, if its from another script inside the original, we follow the same steps as before. In the case of binary, to make sure it is the same thing giving the error, we proceed with a second beautiful command - 'strings'.
strings(1) print the strings of printable characters in files.
Example: Suppose if our error message was 'user joe does not exist' and the program being /bin/su. At times, we will have to take only necessary part of the Error Message removing variable part. So in the above case, we remove 'joe' and take 'does not exist' only.
Now we confirm that the error message is given by the binary we guessed. To proceed from here, we need to know how to use a bit of powerful tools.
We move on to a few more complex, but very helpful commands. Out of this, strace is the one I prefer very much and have helped me a lot. In my opinion, every sysadmin should know how to use strace.
strace is used to trace system calls and signals. With a bit of practice, strace can be used effectively to find what is going wrong and where.
Sample output of strace command. I am pasting the relevant portion only.
As we all know, users are not allowed to view the /etc/shadow file and the strace shows this much. One thing to note here is to run the strace program as the correct user. Some applications, though they start running as root, forks off to another user before doing something. So we must be executing strace also as the same user name. If its a user without a shell, we can do that first by switching to that user and giving a shell.
puts you to user nobody with bash as the shell. Now execute strace from this shell. It is possible that there may be errors when we run like this which wont be there when we run as root, as root is the superuser and is allowed to do almost anything without any restriction.
shows that the user does not have permission to open the file even for reading. For a missing file or directory, strace shows something like this
Another important use of strace is to find out which configuration file, a binary is using. When we install something from source (./configure; make; make install), the files may get installed to various locations and may not always be traceable. strace helps here also.
In the following example, we find out the configuration file for sshd.
and in the output, we see
In this case, its an rpm installation and the path is default one. But this helps in the case of other source installations.
strace can also be used to trace an already started program.
Note that, in the case of programs with multiple instances (forked off children), this pid must be of the main instance.
As you can see, the first instance in this apache is the one running as root with pid 2481. So, to debug that, do:
In the above example, we saw how to use strace on a running program. But in the case of apache, though there is one main instance running, apache forks off lot of other instances also. The -f option of strace deals with them. If one of the forked child process is giving the error, strace with -f catches this.
In the first case, we use strace on an already running binary and the -f traces the child processes created by the fork() system call as and when they are created.
Second case is similar except that here, strace is used to start a binary.
strace -o output.log writes the output to log file. Can be helpful when dealing with programs that run for a while before giving error.
gdb is a very powerful debugger which is mostly used for debugging the core files produced by programs which segfault. But these days, most programs wont core dump. Either because our shells are set like that (ulimit can change the behavior with core files) or because the program is designed not to do that. Even in this case, it can give clues on the problem we are trying to troubleshoot. To start a program using gdb, do
Note that the command line arguments are not specified here. We specify it later.
At this prompt, we give run followed by arguments to be passed to the binary.
This will continuously run the program. To break it and wait at a point, we should introduce break point before running it. Suppose we know a function name up to which the program executed normally (from the output of strace), we do
gdb runs the program up to that function and waits there. We can issue commands to trace the program line by line. Please note that, when we specify break function, the function should be part of the main binary and not the included libraries.
Here is a typical condition we have to debug. Chrooting to a virtual host in ensim server.
As you can see, bin/bash (site's chrooted bin) and /bin/bash are there. Still the chroot program gives error saying bash is not there.
These are the lines from strace
Of course, this is of not much use because it is also saying /bin/bash not found when its there. So we move to gdb.
One point worth noting here is this line:
The part we should watch carefully is 'cannot execute /bin/bash'. Then only it says 'No such file or directory'.
Here also, we don't have much luck, but we know its something related to library files. So we use the tool 'ldd' to find more about the libraries and their linkage with the binaries.
We all know that linux binaries are mostly shared ones which depend on a lot of libraries for their working. ldd command gives the library files on which a binary is dependent on.
We used ldd on bin/bash and not /bin/bash since chroot should be using the bin/bash (remember we are inside the directory to be chrooted). The command listed all the library files on which bin/bash is depending on.
Now we check the sanity of these library files one by one.
Though the needed binary is there, its shared libraries are missing!. If it was on the server wide /lib or /usr/lib, we need to find out the packages corresponding to that and install them. In this ensim's case, its all linked to their TEMPLATE copies. So make the required hard links now.
After making the hard links and also making the required soft links (library files have lot of links to the latest version), we try chroot once more.
Now, all is perfect!.
Till this point, it was all about doing it ourselves. We all know, no one is perfect or complete. If we did all what we could do and still we don't have the solution to the problem, then hope for the best - we may not be the only person having the same problem.
Internet is a vast collection of information and search engines like google help us to locate the needed information within a short time. I personally prefer google as the best search engine. But, still there are some problems. Google is not an intelligent robot; it can't guess what is there in your head or what you are looking for. You must be able to present your question in the 'most obvious way' still not losing the context information. That is where the importance of 'effective googling' comes.
Google has a lot of keywords that can fine tune our search. For example to search for site's names only, we can specify site:search_keyword. To search for a pattern with more than one word, we can include them in quotes. Google normally removes common words like 'and', 'or', 'when' etc from the search patten. To forcefully include them in the search, use a '+' just before the word. ( +word_to_be_included)
This link may be helpful.
At times, searching with the exact error message wont help. There we will have to find what is the 'real problem' and search for it.
Though articles like this can give guidelines, experience does matter here. The more your working experience with linux machines, the more faster you will get into solution. Difference between the experienced ones and others is the direction of thinking. You have more knowledge about the system means, you can easily guess the problem areas and concentrate thinking in that way. That said, there are instances where none of the above steps work, then probably we are the first one with such an error. In this case, someone regularly dealing with the machine has to do a backtracking through his/her recent tasks done on the machine. Any new softwares installed, configuration changes or anything like that could be the root cause.
Jemshad OK worked for close to 3 years in Bobcares.com, Tech support company for WebHosts and ISPs. Now he works in Yahoo.
Got something to add? Send me email.
(OLDER) <- More Stuff -> (NEWER) (NEWEST)
Printer Friendly Version
Increase ad revenue 50-250% with Ezoic
More Articles by Jemshad OK © 2011-03-19 Jemshad OK