The Art of Troubleshooting

by Jemshad O K

1 Introduction

Yes, troubleshooting is an art!. The key points to mastering this art is knowing the system in and out, using the right tools and, of course, googling. Troubleshooting a problem is not something that can be spoon fed or taught with precise steps. It has to evolve from logical thinking and thorough knowledge of the system.

This article is no capsule which you can swallow and say "I know troubleshooting". But this is just what I think will help you in debugging a problem. What is written below is mostly from my experience over the past few years.

The first step of debugging any problem is knowing the problem well. After we have found the problem, its time to fix it. But how?. There comes the importance of using the right tools and knowing how to use them effectively. In *nix world, system administration means playing with lot of commands and effective piping of them. Knowing the tools only is not the only requirement. We should know where to use them - the source of data. And, if everything else fails, lets hope we are not the only one facing the same problem, so what? - google it!. We will cover each of these in a bit detail. I will cite with examples whenever possible.

2 Understanding the problem

As I said earlier, understanding the problem is the key aspect of troubleshooting. Without knowing the actual problem, we will be just roaming around here and there with no result. Suppose we are running a script and it results in an error message. Here, that error message which the script gave is not the problem and we should not be chasing that. But find out what caused the script to give error message. The next section covers some tools which would help you here.

2.1 grep

Data we have:

  1. A script
  2. An error message we get when running the script

Now our first job is to find out why the script resulted in error. 'grep' is a very good command to start with. So we do:

$ grep 'Error Message' scriptname

If we have a result, then our job is fairly simple - just go through the script, find what condition is failing and fix it. But, practically in most cases, this kind of a grep wont give any result - which means, the error message is not directly from the script. This is the importance of knowing the problem. Suppose the error message was something like 'username not found', but on checking shows that the username do exist. There, a non existent username is not our 'problem', its something else which we are yet to find out.

Now we have the second case, grep didn't give any output. Then we know that the script is calling some other script or binary which may be giving the error. If its a small script, pointing out the location is easier. But if we have large script with lot of branches throughout, we have to introduce check points (put some print lines here and there and find the exact location which gives the error).

After pin pointing the portion of the script giving the error, if its from another script inside the original, we follow the same steps as before. In the case of binary, to make sure it is the same thing giving the error, we proceed with a second beautiful command - 'strings'.

$ strings binary_name | grep 'Error Message'

strings(1) print the strings of printable characters in files.

Example: Suppose if our error message was 'user joe does not exist' and the program being /bin/su. At times, we will have to take only necessary part of the Error Message removing variable part. So in the above case, we remove 'joe' and take 'does not exist' only.

$ strings /bin/su | grep 'does not exist'

user %s does not exist


Now we confirm that the error message is given by the binary we guessed. To proceed from here, we need to know how to use a bit of powerful tools.

3 Using the right tools

We move on to a few more complex, but very helpful commands. Out of this, strace is the one I prefer very much and have helped me a lot. In my opinion, every sysadmin should know how to use strace.

3.1 Using strace to debug a binary

Running a program under strace

strace is used to trace system calls and signals. With a bit of practice, strace can be used effectively to find what is going wrong and where.

Sample output of strace command. I am pasting the relevant portion only.

Usage: strace command arguments

$ strace less /etc/shadow (This is just an example to show how the output would be like)

wait4(6711, [WIFEXITED(s) && WEXITSTATUS(s) == 127], 0, NULL) = 6711

stat64("/etc/shadow", {st_mode=S_IFREG|0400, st_size=1388, ...}) = 0

stat64("/etc/shadow", {st_mode=S_IFREG|0400, st_size=1388, ...}) = 0

open("/etc/shadow", O_RDONLY|O_LARGEFILE) = -1 EACCES (Permission denied)

write(2, "/etc/shadow: Permission denied\n", 31/etc/shadow: Permission denied

As we all know, users are not allowed to view the /etc/shadow file and the strace shows this much. One thing to note here is to run the strace program as the correct user. Some applications, though they start running as root, forks off to another user before doing something. So we must be executing strace also as the same user name. If its a user without a shell, we can do that first by switching to that user and giving a shell.

$ su - nobody -s /bin/bash

puts you to user nobody with bash as the shell. Now execute strace from this shell. It is possible that there may be errors when we run like this which wont be there when we run as root, as root is the superuser and is allowed to do almost anything without any restriction.

open("/etc/shadow", O_RDONLY|O_LARGEFILE) = -1 EACCES (Permission denied)

shows that the user does not have permission to open the file even for reading. For a missing file or directory, strace shows something like this

open("path/to/file", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)

Another important use of strace is to find out which configuration file, a binary is using. When we install something from source (./configure; make; make install), the files may get installed to various locations and may not always be traceable. strace helps here also.

In the following example, we find out the configuration file for sshd.

$ strace /usr/sbin/sshd

and in the output, we see

open("/etc/ssh/sshd_config", O_RDONLY|O_LARGEFILE) = 3

In this case, its an rpm installation and the path is default one. But this helps in the case of other source installations.


Using strace on a running program ( strace -p )

strace can also be used to trace an already started program.

Usage: strace -p pid (where pid is the process id of the program)

Note that, in the case of programs with multiple instances (forked off children), this pid must be of the main instance.

$ ps -ef | grep http

root 2481 1 0 Sep22 ? 00:00:00 /usr/local/apache/bin/httpd -DSSL

apache 2490 2481 0 Sep22 ? 00:01:48 /usr/local/apache/bin/httpd -DSSL

apache 2491 2481 0 Sep22 ? 00:01:36 /usr/local/apache/bin/httpd -DSSL

As you can see, the first instance in this apache is the one running as root with pid 2481. So, to debug that, do:

$ strace -p 2481

Using strace to chase forked child processes ( strace -f )

In the above example, we saw how to use strace on a running program. But in the case of apache, though there is one main instance running, apache forks off lot of other instances also. The -f option of strace deals with them. If one of the forked child process is giving the error, strace with -f catches this.

$ strace -f -p pid

$ strace -f /path/to/program

In the first case, we use strace on an already running binary and the -f traces the child processes created by the fork() system call as and when they are created.

Second case is similar except that here, strace is used to start a binary.

strace -o output.log writes the output to log file. Can be helpful when dealing with programs that run for a while before giving error.


3.2 Using gdb for in-depth debugging.

gdb is a very powerful debugger which is mostly used for debugging the core files produced by programs which segfault. But these days, most programs wont core dump. Either because our shells are set like that (ulimit can change the behavior with core files) or because the program is designed not to do that. Even in this case, it can give clues on the problem we are trying to troubleshoot. To start a program using gdb, do

$ gdb -e /complete/path/to/program

Note that the command line arguments are not specified here. We specify it later.

$ gdb -e /bin/su

GNU gdb Red Hat Linux (5.3post-0.20021129.18rh)

Copyright 2003 Free Software Foundation, Inc.

GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions.

Type "show copying" to see the conditions.

There is absolutely no warranty for GDB. Type "show warranty" for details.

This GDB was configured as "i386-redhat-linux-gnu".


At this prompt, we give run followed by arguments to be passed to the binary.

(gdb) run arg1 arg2

This will continuously run the program. To break it and wait at a point, we should introduce break point before running it. Suppose we know a function name up to which the program executed normally (from the output of strace), we do

(gdb) break function (or break line_number)


(gdb) run arguments

gdb runs the program up to that function and waits there. We can issue commands to trace the program line by line. Please note that, when we specify break function, the function should be part of the main binary and not the included libraries.

Here is a typical condition we have to debug. Chrooting to a virtual host in ensim server.

[root@host fst]# pwd /home/virtual/site3/fst

[root@host fst]# chroot .

chroot: cannot execute /bin/bash: No such file or directory

[root@host fst]# ls -l bin/bash /bin/bash

-rwxr-xr-x 1 root root 541096 Apr 12 2002 /bin/bash

-rwxr-xr-x 19 root root 541096 Jan 18 2004 bin/bash

[root@host fst]#

As you can see, bin/bash (site's chrooted bin) and /bin/bash are there. Still the chroot program gives error saying bash is not there.

These are the lines from strace

execve("/bin/bash", ["/bin/bash", "-i"], [/* 18 vars */]) = -1 ENOENT (No such file or directory)

write(2, "/usr/sbin/chroot: ", 18/usr/sbin/chroot: ) = 18

write(2, "cannot execute /bin/bash", 24cannot execute /bin/bash) = 24

write(2, ": No such file or directory", 27: No such file or directory) = 27

Of course, this is of not much use because it is also saying /bin/bash not found when its there. So we move to gdb.

One point worth noting here is this line:

write(2, "cannot execute /bin/bash", 24cannot execute /bin/bash) = 24

The part we should watch carefully is 'cannot execute /bin/bash'. Then only it says 'No such file or directory'.

(gdb) run .

Starting program: /usr/sbin/chroot .

(no debugging symbols found)...Breakpoint 1 at 0x400ee570

(no debugging symbols found)...

Breakpoint 1, 0x400ee570 in chroot () from /lib/

(gdb) next

Single stepping until exit from function chroot,

which has no line number information.

0x08048cca in chroot ()

(gdb) next

Single stepping until exit from function chroot,

which has no line number information.

/usr/sbin/chroot: cannot execute /bin/bash: No such file or directory

Program exited with code 01.

Here also, we don't have much luck, but we know its something related to library files. So we use the tool 'ldd' to find more about the libraries and their linkage with the binaries.


3.3 Using ldd for finding library dependencies of a binary.

We all know that linux binaries are mostly shared ones which depend on a lot of libraries for their working. ldd command gives the library files on which a binary is dependent on.

[root@host fst]# ldd bin/bash => /lib/ (0x40018000) => /lib/ (0x4001d000) => /lib/ (0x40020000)

/lib/ => /lib/ (0x40000000)

We used ldd on bin/bash and not /bin/bash since chroot should be using the bin/bash (remember we are inside the directory to be chrooted). The command listed all the library files on which bin/bash is depending on.

Now we check the sanity of these library files one by one.

[root@host fst]# ls -l lib/

ls: lib/ No such file or directory

[root@host fst]# ls -l lib/ ls: lib/ No such file or directory

[root@host fst]# ls -l lib/ ls: lib/ No such file or directory

[root@host fst]# ls -l lib/ ls: lib/ No such file or directory

Though the needed binary is there, its shared libraries are missing!. If it was on the server wide /lib or /usr/lib, we need to find out the packages corresponding to that and install them. In this ensim's case, its all linked to their TEMPLATE copies. So make the required hard links now.

After making the hard links and also making the required soft links (library files have lot of links to the latest version), we try chroot once more.

[root@host fst]# chroot .


Now, all is perfect!.

Till this point, it was all about doing it ourselves. We all know, no one is perfect or complete. If we did all what we could do and still we don't have the solution to the problem, then hope for the best - we may not be the only person having the same problem.

4 Google is your friend

Internet is a vast collection of information and search engines like google help us to locate the needed information within a short time. I personally prefer google as the best search engine. But, still there are some problems. Google is not an intelligent robot; it can't guess what is there in your head or what you are looking for. You must be able to present your question in the 'most obvious way' still not losing the context information. That is where the importance of 'effective googling' comes.

Google has a lot of keywords that can fine tune our search. For example to search for site's names only, we can specify site:search_keyword. To search for a pattern with more than one word, we can include them in quotes. Google normally removes common words like 'and', 'or', 'when' etc from the search patten. To forcefully include them in the search, use a '+' just before the word. ( +word_to_be_included)

This link may be helpful.

At times, searching with the exact error message wont help. There we will have to find what is the 'real problem' and search for it.

5 Conclusion

Though articles like this can give guidelines, experience does matter here. The more your working experience with linux machines, the more faster you will get into solution. Difference between the experienced ones and others is the direction of thinking. You have more knowledge about the system means, you can easily guess the problem areas and concentrate thinking in that way. That said, there are instances where none of the above steps work, then probably we are the first one with such an error. In this case, someone regularly dealing with the machine has to do a backtracking through his/her recent tasks done on the machine. Any new softwares installed, configuration changes or anything like that could be the root cause.

Author Bio

Jemshad OK worked for close to 3 years in, Tech support company for WebHosts and ISPs. Now he works in Yahoo.

Got something to add? Send me email.

(OLDER) <- More Stuff -> (NEWER)    (NEWEST)   

Printer Friendly Version

-> -> The Art of Troubleshooting

Increase ad revenue 50-250% with Ezoic

More Articles by © Jemshad OK

Kerio Samepage

Have you tried Searching this site?

Support Rates

This is a Unix/Linux resource website. It contains technical articles about Unix, Linux and general computing related subjects, opinion, news, help files, how-to's, tutorials and more.

Contact us

There are only two hard problems in computer science: cache invalidation and naming things. (Phil Karlton)

This post tagged: