There are no problems, only income opportunities-Tony Lawrence
This is a general trouble shooting guide. You may
have stumbled across it searching for "troubleshooting guide" or
"troubleshooting tips" for some specific problem; this is a general
guide for Linux and Unix troubleshooting.
Sometimes computer problems can feel like Charles Dickens' dying
woman- you know something is wrong, but you have no idea where or
how to start looking for it. This article offers help for that.
Some of this is general enough to be operating system non-specific,
but Unix/Linux is going to get most of the attention.
This article mostly avoids specifics: if your printer is
garbling output after 20 pages, that's a flow control problem and
isn't discussed here (it is discussed in the Serial Printers article). What this is
about is general troubleshooting and the tools and techniques that
can help you track down any problem.
Up to date?
If you are installing new hardware, or having trouble with
anything new, do you have the right drivers for any hardware being
added? Just because the OS HAS a driver doesn't mean that it is a
GOOD driver- for example, SCO OSR5 had a driver for Intel Pro100
cards, but if you wanted something that WORKED, you had to go to
SCO's FTP site and
download a good driver.
On a similar note, do you have the current recommended patches
and updates? Sometimes this is all it takes to fix your problem.
It's certainly worth checking.
The first rule of troubleshooting is to do no harm. Another way
to put that is to say "don't trample all over the evidence". If you
aren't careful and methodical in your approach, you may destroy
clues that could help you narrow down the source of your problem.
For example, Unix and Linux keep a "last accessed" date for files
and directories. You can see that date by using "ls -u filename" or
"ls -du directoryname". That date can be a critical clue as we'll
see in a moment, but it's so easy to lose it. Try this experiment:
cd /tmp and make a new directory "testdate". Do
ls -lud testdate
ls -lud testdate
# still the same date
ls -l testdate
# nothing to see in there, but
ls -lud testdate
# now the access date has changed
The same change will take place to an ordinary file if you cat
it, copy it or access it in any way (because it's the access
(Confused? The "ls -lud" reads the /tmp directory to get info
about "testdate". The "ls -l" looks into "testdate" so that's
Needing or wanting to know when something was last accessed
comes up all the time, but just a few examples might give you some
A directory was supposed to be on the backup but
isn't. Did the backup program access that directory?
A misbehaving program is supposed to use a
certain file during its startup. Did it?
What files does a program try to use? Knowing
this can sometimes help you track down where a program is failing
when it is too dumb to tell you itself.
Unix systems keep two other dates: modify, and inode change. The
modified date is what you see when you use "ls -l" and is the date
that the file has been changed (or created, if nobody has changed
it since). The inode change date (ls -lc) reflects the time that
permissions or ownership have changed (but note that anything that
affects the modified time also affects this).
Some systems have a command line "stat" command that shows all
three times (and more useful info) at once. Here's the output from
"stat" on a Red Hat Linux system:
Note that "Modify" and "Change" are the same. That's becaue
"Change" will also reflect the time that a file was modified.
However, if we used chmod or chown on this file, the "Change" date
would reflect that and "Modify" would not.
You can get a similar listing on SCO, but you need to know the
inode (ls -i), the filesystem the file is on, and then run fsdb on
that file system. It's not usually worth the trouble, but it does
also tell you where the file's disk blocks are to be found:
# cd /tmp
# l -i y
33184 -rw-r--r-- 1 root sys 143 Aug 11 14:01 y
(33184 is the inode number of the file "y")
# fsdb /dev/root
/dev/root(): HTFS File System
FSIZE = 1922407, ISIZE = 480608
i#: 33184 md: f---rw-r--r-- ln: 1 uid: 0 gid: 3 sz: 143
a0:337145 a1: 0 a2: 0 a3: 0 a4: 0 a5: 0 a6: 0
a7: 0 a8: 0 a9: 0 a10: 0 a11: 0 a12: 0
at: Tue Aug 15 11:06:35 2000
mt: Fri Aug 11 14:01:10 2000
ct: Fri Aug 11 14:01:10 2000
This tells me the three times, and also that the entire file is
located in one block (block 337145). If it were larger, and the
next block was not 337146, I'd also know that the file is
fragmented (once you get above a9, the rules change; see BFIND.C for a brief introduction to
The inode time can be very illuminating: suppose a problem
started yesterday, and you can see that "ls -c" shows an inode
change then but no "ls -l" change: it might be that ownership or
permissions have been changed and are causing your problem.
Of course, if you don't know what the ownership or permissions
should be, this may not help a lot. Some systems have a database of
file ownership and permissions and can report discrepancies and
even return the files to their proper state. SCO Unix, for example,
can use the "custom" command to verify software. Old Xenix systems
had "fixperm". Linux systems using RPM can verify packages; other
Linux package managers have similar capabilities. Add-on security
packages like COPS and others can
also be useful: although their concern is the security aspects of
files changing, their watchfulness can be useful in troubleshooting
If you need to know what files a process is using, the "lsof"
command (standard with Linux, available from Skunkware for SCO) can help.
Here's an example from a Linux system:
# lsof -p 1748
COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
httpd 1748 root cwd DIR 8,5 4096 2 /
httpd 1748 root rtd DIR 8,5 4096 2 /
httpd 1748 root txt REG 8,6 242860 361425 /usr/sbin/httpd
httpd 1748 root mem REG 8,5 494250 216911 /lib/ld-2.2.4.so
httpd 1748 root mem REG 8,6 10007 264042 /usr/lib/apache/mod_vhost_alias.so
httpd 1748 root mem REG 8,6 8169 263597 /usr/lib/apache/mod_env.so
httpd 1748 root mem REG 8,6 17794 263604 /usr/lib/apache/mod_log_config.so
httpd 1748 root mem REG 8,6 7562 263603 /usr/lib/apache/mod_log_agent.so
httpd 1748 root mem REG 8,6 8558 263605 /usr/lib/apache/mod_log_referer.so
httpd 1748 root mem REG 8,6 8142 263596 /usr/lib/apache/mod_dir.so
httpd 1748 root mem REG 8,6 370 117962 /usr/lib/locale/en_US/LC_IDENTIFICATION
httpd 1748 root mem REG 8,5 531205 110159 /lib/i686/libpthread-0.9.so
(many lines deleted)
This tool can also list open network connections. For example,
if you need to know what process are listening on port 137 , you
# lsof -i :137
COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
nmbd 903 root 5u IPv4 1255 UDP *:netbios-ns
nmbd 903 root 7u IPv4 1260 UDP apl:netbios-ns
Another piece of evidence that can be very helpful is when
someone has logged in. The "last" comand will give you that
information, including how long they were on the system. But "last"
gets its data from /etc/wtmp, so you may want to get information
about wtmp before running last (keep in mind that wtmp is affected
by logins and logouts, so unless you were logged in BEFORE whatever
problem you are looking for started, your login has changed
How long the system itself has been up is available from "w" or
"uptime". Other evidence you may want to collect before beginning a
trouble search includes:
System logs may be very useful. However, before you look at
them, get the "ls -l", "ls -c" and "ls -u" from the logs- this
tells you when the log was last used, etc.- if there's no current
information and there should be, those dates are important
You need to know what to look for in the logs. SCO makes it
easier by using words like "CONFIG" "NOTICE" and "WARNING" that you
can grep for, but on other OS'es you may have to just look manually
until you can figure out what sort of key words they would use.
Another often overlooked clue is "what processes are using this
file?". You get the answer to that with "fuser", which will return
a list of the processes using a file ("lsof" is another very useful
tool in this context).
Software problems can be tough. If you have "trace" (Linux
"strace", Mac OS X "ktrace" and "kdump"), knowing what files a
program tries to read and write can be very useful. A first attempt
is to run the program like this:
trace badprog 2> /tmp/badprog.trace
Then examine /tmp/badprog.trace, particularly noting access,
open, read and write calls. These will all have the general format
The return value of the call is your clue- if it's positive,
it's usually OK.For example, the lines above mean that
/etc/default/lang was opened correctly ( we would have got a
negative number instead of 4 if it failed) and then 437 bytes were
read from it- see the "read(4.." ? That 4 means it's reading
/etc/default/lang. Later on it opens "/usr/lib/lang/C/C/C/collate"
and the return value is "5", the next read is from that file
(because it's "read(5..". As you can see, you really don't need to
be a programmer or even understand much about this to be able to
(possibly) find a problem here (do watch for "close", though- if
you aren't paying attention you won't know what file is being read
because the numbers get reused).
Sometimes it's helpful to know WHAT files got modified by an
application. To find that out, this can help:
The kiss ("keep it simple stupid") principle is always good to
follow. Removing unnecessary hardware, making the network smaller,
etc. are good examples. Dirk Hart posted this in response to a
problem that involved one of those nasty sub-nets that make our
In this situation, I would use a netmask of 255.255.255.0 on the '486
(and on another pc to avoid confusion) and determine that the card is
indeed working on the 486. _Then_ I would worry about binary arithmetic.
'It don't mean a thing if it ain't got that ping'
Testing Specific Systems
You may have a very good idea of what you need to concentrate
on. If a printer isn't working, bad disk blocks aren't a likely
suspect. However, the problem may be less obvious: data corruption
in the middle of a file could be bad disk blocks, bad memory, bad
cpu, bad disk controller, a bad DMA controller.. how do you narrow
this stuff down? Well, honestly, sometimes it can be hellishly
difficult, particularly if you are pressed for time. If you don't
know where to start, and you aren't sure if the problem is hardware
or software, checking the hardware is probably the best place to
start. Unfortunately, some of the things you'll need to for that
require that the system be in single user mode.
Single user mode does NOT mean that just one person is
logged in. Specifically, it means that the system has been brought
to init state 1, usually by typing "init 1" while logged in as root
at the console.
Things you can check while still multi-user
Log files. On some sytems, "dmesg" gives you a lot of
information, on others (SCO, for instance) it doesn't give you much
and you have to look at the logs yourself. You'll find these logs
in /var/log or /usr/adm (SCO). The "syslog" is often particularly
useful. If printing is the problem, the printlogs are in
/usr/spool/lp/logs (SCO), or the "lf" entry in /etc/printcap points
you to them.
Network statistics. If the network isn't working right, all
kinds of other things may be in trouble. To be good at this, you
really need to understand networking more than I want to try to
cover here, but I will cover some basic points:
If you can ping 127.0.0.1, your tcp/ip is working but that says
NOTHING about your NIC card. As pinging your own ip address will
get re-routed to 127.0.0.1, pinging your own ip address won't prove
anything about your NIC either.
Always try netstat and arp commands with the "-n" flags to
avoid name resolution lookups- this keeps everything local and aids
You can see what network ports are in use with "netstat -an"
(lsof is also useful for this). Let's just take a quick look :
Active Internet connections (including servers)
Proto Recv-Q Send-Q Local Address Foreign Address (state)
tcp 0 0 188.8.131.52.1254 184.108.40.206.119 ESTABLISHED
tcp 0 0 220.127.116.11.1252 18.104.22.168.119 CLOSE_WAIT
tcp 0 0 22.214.171.124.1216 126.96.36.199.80 CLOSE_WAIT
tcp 0 0 10.1.36.3.1085 10.1.36.100.23 ESTABLISHED
tcp 0 0 *.80 *.* LISTEN
What can I tell from this? First, I'm reading news from
188.8.131.52. How do I know that? See the "119" that follows that
address? If I "grep 119 /etc/services, I get:
nntp 119/tcp readnews untp # USENET News Transfer Protocol
I also was reading news from another site, but I'm not
presently- that's why that line says "CLOSE_WAIT" rather than
"ESTABLISHED". I had a connection to a web page (the "80"- grep it
from /etc/services if you don't recognize it) and I have a telnet
session (23) open to 10.1.36.100. This machine also has a
web-server, and it's LISTENing on port 80. Of course, none of this
means anything unless you know what's SUPPOSED to be going on. So,
if the web server isn't working, and you "netstat -an | grep
LISTEN" and don't see *.80 in the list, that's why it isn't
working. Now WHY isn't it working? If you try to start it and it
fails, its logs will likely tell you what the problem is. If not,
then maybe trace or strace can give you a clue.
A system that is running, but slowly, can be hard to figure out.
The slowness is either coming from the CPU or the hard drive. The
"sar" program (available for Linux, too) will let you figure out
which pretty quickly. First try"sar 5 5". That gives you
information about your CPU. Then, "sar -d 5 5" will tell you about
your hard drives. The only problem is that, unless it's flat
obvious (CPU at 0% idle or hard drives showing 1,000 msec in the
avserv column), you don't know if what you are seeing is normal or
abnormal for this machine running its typical load. If sar has been
set up and employed properly, you'll have historical data (in
/var/adm/sa on SCO) to compare, but otherwise you have to just use
your best judgement, and if you haven't had much experience on
similar systems, you just won't know. See Sarcheck Review(May 1999) for more
If CPU is the problem, a simple "ps -e | sort -r +2 | head" can
tell you a lot, particularly if you put it in a loop like this:
ps -e | sort -r +2 | head
If that shows a process gaining a substantial amount of time
during the 10 second sleep, that process is using a lot of time- if
it gained 5 seconds, for example, it is using 50% of your cpu! You
might also use "ps -ef" and look at the "C" column (just before the
time)- that's total CPU usage.
If you are slow on a network connection, "netstat -in" and (on
SCO) "llistat -l" can show you problems. Don't let DNS issues
confuse you: if, for example, a telnet session is slow to connect
but then works fine, that's DNS- either the connecting machine
can't resolve the server's name or vice-versa. Remember that: the
server is going to try to resolve the name of the client, and if it
can't, there will be a delay- possibly a long delay.
Sometimes it helps to know just what equipment is in the
machine. SCO has the "hw -v" command and "hwconfig", Windows has
its Control Panel -> System -> Device Manager, and linux has
lsdev, lspci, pnpdump and, of course, dmesg.
If you are having troubles, particularly during an installation,
take out any hardware you don't need right now. You can always put
it in later.
If you can identify a good-sized file that is not being used by
anyone (check with fuser or lsof), running repeated "sum -r"'s on
that file should always produce the same result. If it does not,
and you are certain no-one else is modifying the file, then the
hard disk, memory, disk controller, dma controller or cpu could be
suspect. Gee, doesn't that narrow it down? Well, maybe we can get a
The first time you sum the file, the information should be read
from the hard drive directly unless it is already in cache( to
effectively flush the cache, you need to know how big it is so you
can sum, or just cat to /dev/null, some other, larger file which
will overwrite any trace of this file). On SCO, the cache size can
be found by
grep "i/o bufs" /usr/adm/messages
On linux "dmesg | grep cache" will get you what you need. If you
pick your file size to be slightly less than cache, you've got a
good shot at getting it all into memory (assuming no one else is
using the machine just then).
The second time you sum, some portion or even all of the file
will be read from memory, so if the sum's were constant after
flushing cache, but not otherwise, memory would be suspect.
You might get a clue about DMA by testing other devices that use
it- the floppy disk is a good candidate: if data can be reliably
written and read from floppy (use sum -r to check it) but hard
drive data is changing, then it's probably NOT memory or
Things you need to be in single-user mode for
The first thing you will want to do is run fsck. Except for
root, file systems should be unmounted when running this on them-
you could umount /home while multiuser, of course, but since you
are going to be checking all of them, you may as well be
single-user. Be sure to check the man page; SCO, for example,
doesn't do a full fsck on their modern versions unless you specify
"-ofull" (see Filesystems for
a more complete discussion of fsck).
What does fsck tell you? Well, if you have hard disk problems,
fsck is going to trip over them, so that can be useful. Because
fsck uses a fair amount of memory as it's working, memory problems
will probably affect it -unfortunately, the effect may be
unpleasant, so if you suspect that memory could be an issue, try
creating a floppy file system first ( just "mkfs /dev/fd0" is
sufficient ) and fsck it- that won't make you 100% sure that fsck
isn't going to ruin your life for reasons beyond its control, but
it gives you a little more confidence. I wouldn't fret over this-
if you've had bad memory problems, things on your disk are likely
not healthy already.
After fsck, you want to check the hard drive. SCO has "badtrk",
Linux has "badblocks". Read the man pages carefully- you do NOT
want to accidentally destroy your system, and these things can.
If your problems began after linking a new kernel, maybe you
should boot with the previous version. On SCO, you automatically
get "unix.old"; just type that at the Boot prompt. On Linux, you
get the old kernel if you bothered to arrange for that in lilo.conf
(or whatever your boot manger uses). Try hitting TAB at the boot
prompt to see if you have other choices.
Maybe you still can't figure it out. Some file is getting
trashed for unknown reasons and you can't get a handle on it.
Perhaps you can set up some cron or background process to try to
catch whatever is going on in the act. Something like this:
sum -r suspect_file
fuser (or lsof) suspect file
done > /tmp/mylog &
You can get more sophisticated; if I have to leave this for days
or weeks I send the sum to temporary files, compare them and delete
them if nothing has changed; or I'll collect the fuser output and
run it through "ps" so I can see who or what was responsible.
Seeing the invisible
Sometimes you need to see what's really going on without
interpretation. For dumb terminals, you can often turn on a
"monitor" mode in the terminal's setup that will then display the
hex characters of any control characters rather than acting on
them. For a telnet session, you can escape to the telnet prompt
(generally CTRL-] ) and type "set netdata", then escape again and
Another way to capture everything is to use "script". For
example, "script /tmp/mystuff". Then do whatever you are doing, and
when you are done, CTRL-D to end script. The /tmp/mystuff contains
everything that happened, including control characters; you can
examine it with vi or hd.