APLawrence.com -  Resources for Unix and Linux Systems, Bloggers and the self-employed
RSS Feeds RSS Feeds











(OLDER) <- More Stuff -> (NEWER) (NEWEST)
Printer Friendly Version
->
-> Unix and Linux TroubleShooting Tips


Unix and Linux Troubleshooting Tips




"I think there's a pain somewhere in the room"- Hard Times by Charles Dickens

It don't mean a thing if it ain't got that ping- Dirk Hart

A clever person solves a problem. A wise person avoids it. -- Einstein (from Software Engineering Proverbs)

There are no problems, only income opportunities-Tony Lawrence

This is a general trouble shooting guide. You may have stumbled across it searching for "troubleshooting guide" or "troubleshooting tips" for some specific problem; this is a general guide for Linux and Unix troubleshooting.

Sometimes computer problems can feel like Charles Dickens' dying woman- you know something is wrong, but you have no idea where or how to start looking for it. This article offers help for that. Some of this is general enough to be operating system non-specific, but Unix/Linux is going to get most of the attention.

This article mostly avoids specifics: if your printer is garbling output after 20 pages, that's a flow control problem and isn't discussed here (it is discussed in the Serial Printers article). What this is about is general troubleshooting and the tools and techniques that can help you track down any problem.

Up to date?

If you are installing new hardware, or having trouble with anything new, do you have the right drivers for any hardware being added? Just because the OS HAS a driver doesn't mean that it is a GOOD driver- for example, SCO OSR5 had a driver for Intel Pro100 cards, but if you wanted something that WORKED, you had to go to SCO's FTP site and download a good driver.

On a similar note, do you have the current recommended patches and updates? Sometimes this is all it takes to fix your problem. It's certainly worth checking.

Evidence

The first rule of troubleshooting is to do no harm. Another way to put that is to say "don't trample all over the evidence". If you aren't careful and methodical in your approach, you may destroy clues that could help you narrow down the source of your problem. For example, Unix and Linux keep a "last accessed" date for files and directories. You can see that date by using "ls -u filename" or "ls -du directoryname". That date can be a critical clue as we'll see in a moment, but it's so easy to lose it. Try this experiment: cd /tmp and make a new directory "testdate". Do

cd /tmp
mkdir testdate
touch testdate/a
ls -lud testdate
sleep 60
ls -lud testdate
# still the same date
ls -l testdate
# nothing to see in there, but
ls -lud testdate
# now the access date has changed
 

The same change will take place to an ordinary file if you cat it, copy it or access it in any way (because it's the access date!).

(Confused? The "ls -lud" reads the /tmp directory to get info about "testdate". The "ls -l" looks into "testdate" so that's access.)

Needing or wanting to know when something was last accessed comes up all the time, but just a few examples might give you some ideas:

A directory was supposed to be on the backup but isn't. Did the backup program access that directory?

A misbehaving program is supposed to use a certain file during its startup. Did it?

What files does a program try to use? Knowing this can sometimes help you track down where a program is failing when it is too dumb to tell you itself.

Unix systems keep two other dates: modify, and inode change. The modified date is what you see when you use "ls -l" and is the date that the file has been changed (or created, if nobody has changed it since). The inode change date (ls -lc) reflects the time that permissions or ownership have changed (but note that anything that affects the modified time also affects this).

Some systems have a command line "stat" command that shows all three times (and more useful info) at once. Here's the output from "stat" on a Red Hat Linux system:

[tony@linux tony]$ stat .bashrc
  File: ".bashrc"
  Size: 124          Filetype: Regular File
  Mode: (0644/-rw-r--r--)         Uid: (  500/    tony)  Gid: (  500/    tony)
Device:  3,6   Inode: 19999     Links: 1    
Access: Fri Aug 18 07:07:17 2000(00000.00:00:13)
Modify: Mon Oct 25 03:51:14 1999(00298.03:16:16)
Change: Mon Oct 25 03:51:14 1999(00298.03:16:16)
 

Note that "Modify" and "Change" are the same. That's becaue "Change" will also reflect the time that a file was modified. However, if we used chmod or chown on this file, the "Change" date would reflect that and "Modify" would not.

You can get a similar listing on SCO, but you need to know the inode (ls -i), the filesystem the file is on, and then run fsdb on that file system. It's not usually worth the trouble, but it does also tell you where the file's disk blocks are to be found:

# cd /tmp
# l -i y
33184 -rw-r--r--   1 root     sys          143 Aug 11 14:01 y
(33184 is the inode number of the file "y")
# fsdb /dev/root
/dev/root(): HTFS File System
FSIZE = 1922407, ISIZE = 480608
33184i
i#: 33184  md: f---rw-r--r--  ln:    1  uid:    0  gid:    3  sz:      143
a0:337145  a1:     0  a2:     0  a3:     0  a4:     0  a5:     0  a6:     0  
a7:     0  a8:     0  a9:     0  a10:     0  a11:     0  a12:     0  
at: Tue Aug 15 11:06:35 2000
mt: Fri Aug 11 14:01:10 2000
ct: Fri Aug 11 14:01:10 2000
 

This tells me the three times, and also that the entire file is located in one block (block 337145). If it were larger, and the next block was not 337146, I'd also know that the file is fragmented (once you get above a9, the rules change; see BFIND.C for a brief introduction to that).

The inode time can be very illuminating: suppose a problem started yesterday, and you can see that "ls -c" shows an inode change then but no "ls -l" change: it might be that ownership or permissions have been changed and are causing your problem.

Of course, if you don't know what the ownership or permissions should be, this may not help a lot. Some systems have a database of file ownership and permissions and can report discrepancies and even return the files to their proper state. SCO Unix, for example, can use the "custom" command to verify software. Old Xenix systems had "fixperm". Linux systems using RPM can verify packages; other Linux package managers have similar capabilities. Add-on security packages like COPS and others can also be useful: although their concern is the security aspects of files changing, their watchfulness can be useful in troubleshooting contexts also.

If you need to know what files a process is using, the "lsof" command (standard with Linux, available from Skunkware for SCO) can help. Here's an example from a Linux system:

# lsof -p 1748

COMMAND  PID USER   FD   TYPE DEVICE    SIZE   NODE NAME
httpd   1748 root  cwd    DIR    8,5    4096      2 /
httpd   1748 root  rtd    DIR    8,5    4096      2 /
httpd   1748 root  txt    REG    8,6  242860 361425 /usr/sbin/httpd
httpd   1748 root  mem    REG    8,5  494250 216911 /lib/ld-2.2.4.so
httpd   1748 root  mem    REG    8,6   10007 264042 /usr/lib/apache/mod_vhost_alias.so
httpd   1748 root  mem    REG    8,6    8169 263597 /usr/lib/apache/mod_env.so
httpd   1748 root  mem    REG    8,6   17794 263604 /usr/lib/apache/mod_log_config.so
httpd   1748 root  mem    REG    8,6    7562 263603 /usr/lib/apache/mod_log_agent.so
httpd   1748 root  mem    REG    8,6    8558 263605 /usr/lib/apache/mod_log_referer.so
httpd   1748 root  mem    REG    8,6    8142 263596 /usr/lib/apache/mod_dir.so
httpd   1748 root  mem    REG    8,6     370 117962 /usr/lib/locale/en_US/LC_IDENTIFICATION
httpd   1748 root  mem    REG    8,5  531205 110159 /lib/i686/libpthread-0.9.so
(many lines deleted)
 

This tool can also list open network connections. For example, if you need to know what process are listening on port 137 , you would use:

# lsof -i :137
COMMAND PID USER   FD   TYPE DEVICE SIZE NODE NAME
nmbd    903 root    5u  IPv4   1255       UDP *:netbios-ns 
nmbd    903 root    7u  IPv4   1260       UDP apl:netbios-ns 
 

Another piece of evidence that can be very helpful is when someone has logged in. The "last" comand will give you that information, including how long they were on the system. But "last" gets its data from /etc/wtmp, so you may want to get information about wtmp before running last (keep in mind that wtmp is affected by logins and logouts, so unless you were logged in BEFORE whatever problem you are looking for started, your login has changed wtmp).

How long the system itself has been up is available from "w" or "uptime". Other evidence you may want to collect before beginning a trouble search includes:

  • df -v
  • du -s
  • memsize

System logs may be very useful. However, before you look at them, get the "ls -l", "ls -c" and "ls -u" from the logs- this tells you when the log was last used, etc.- if there's no current information and there should be, those dates are important clues..

You need to know what to look for in the logs. SCO makes it easier by using words like "CONFIG" "NOTICE" and "WARNING" that you can grep for, but on other OS'es you may have to just look manually until you can figure out what sort of key words they would use.

Another often overlooked clue is "what processes are using this file?". You get the answer to that with "fuser", which will return a list of the processes using a file ("lsof" is another very useful tool in this context).

Software problems can be tough. If you have "trace" (Linux "strace", Mac OS X "ktrace" and "kdump"), knowing what files a program tries to read and write can be very useful. A first attempt is to run the program like this:

trace badprog 2> /tmp/badprog.trace
 

Then examine /tmp/badprog.trace, particularly noting access, open, read and write calls. These will all have the general format like this:

_open ("/etc/default/lang", 0x0) = 4
read (4, "#\t@(#) lang.src 58.1 96/10/09 \n#\n#\tCopyr".., 1024) = 437
...
_open ("/usr/lib/lang/C/C/C/collate", 0x0) = 5
read (5, "".., 1) = 1
 

The return value of the call is your clue- if it's positive, it's usually OK.For example, the lines above mean that /etc/default/lang was opened correctly ( we would have got a negative number instead of 4 if it failed) and then 437 bytes were read from it- see the "read(4.." ? That 4 means it's reading /etc/default/lang. Later on it opens "/usr/lib/lang/C/C/C/collate" and the return value is "5", the next read is from that file (because it's "read(5..". As you can see, you really don't need to be a programmer or even understand much about this to be able to (possibly) find a problem here (do watch for "close", though- if you aren't paying attention you won't know what file is being read because the numbers get reused).

Sometimes it's helpful to know WHAT files got modified by an application. To find that out, this can help:

touch /tmp/rightnow
runapplication
find / -newer /tmp/rightnow
 

Keep it simple, stupid

The kiss ("keep it simple stupid") principle is always good to follow. Removing unnecessary hardware, making the network smaller, etc. are good examples. Dirk Hart posted this in response to a problem that involved one of those nasty sub-nets that make our brains hurt:


In this situation, I would use a netmask of 255.255.255.0 on the '486
(and on another pc to avoid confusion) and determine that the card is
indeed working on the 486. _Then_ I would worry about binary arithmetic.

'It don't mean a thing if it ain't got that ping'

Regards, Dirk
 

Testing Specific Systems

You may have a very good idea of what you need to concentrate on. If a printer isn't working, bad disk blocks aren't a likely suspect. However, the problem may be less obvious: data corruption in the middle of a file could be bad disk blocks, bad memory, bad cpu, bad disk controller, a bad DMA controller.. how do you narrow this stuff down? Well, honestly, sometimes it can be hellishly difficult, particularly if you are pressed for time. If you don't know where to start, and you aren't sure if the problem is hardware or software, checking the hardware is probably the best place to start. Unfortunately, some of the things you'll need to for that require that the system be in single user mode.

Single user mode does NOT mean that just one person is logged in. Specifically, it means that the system has been brought to init state 1, usually by typing "init 1" while logged in as root at the console.

Things you can check while still multi-user

Log files. On some sytems, "dmesg" gives you a lot of information, on others (SCO, for instance) it doesn't give you much and you have to look at the logs yourself. You'll find these logs in /var/log or /usr/adm (SCO). The "syslog" is often particularly useful. If printing is the problem, the printlogs are in /usr/spool/lp/logs (SCO), or the "lf" entry in /etc/printcap points you to them.

Network statistics. If the network isn't working right, all kinds of other things may be in trouble. To be good at this, you really need to understand networking more than I want to try to cover here, but I will cover some basic points:

  • If you can ping 127.0.0.1, your tcp/ip is working but that says NOTHING about your NIC card. As pinging your own ip address will get re-routed to 127.0.0.1, pinging your own ip address won't prove anything about your NIC either.
  • Always try netstat and arp commands with the "-n" flags to avoid name resolution lookups- this keeps everything local and aids your troubleshooting.

You can see what network ports are in use with "netstat -an" (lsof is also useful for this). Let's just take a quick look :

Active Internet connections (including servers)
Proto Recv-Q Send-Q  Local Address          Foreign Address        (state)
tcp        0      0  64.13.44.12.1254       209.167.40.69.119      ESTABLISHED
tcp        0      0  64.13.44.12.1252       63.209.14.66.119       CLOSE_WAIT
tcp        0      0  64.13.44.12.1216       216.71.1.37.80         CLOSE_WAIT
tcp        0      0  10.1.36.3.1085         10.1.36.100.23         ESTABLISHED
tcp        0      0  *.80                  *.*                    LISTEN
 

What can I tell from this? First, I'm reading news from 209.167.40.69. How do I know that? See the "119" that follows that address? If I "grep 119 /etc/services, I get:

nntp            119/tcp         readnews untp   # USENET News Transfer Protocol
 

I also was reading news from another site, but I'm not presently- that's why that line says "CLOSE_WAIT" rather than "ESTABLISHED". I had a connection to a web page (the "80"- grep it from /etc/services if you don't recognize it) and I have a telnet session (23) open to 10.1.36.100. This machine also has a web-server, and it's LISTENing on port 80. Of course, none of this means anything unless you know what's SUPPOSED to be going on. So, if the web server isn't working, and you "netstat -an | grep LISTEN" and don't see *.80 in the list, that's why it isn't working. Now WHY isn't it working? If you try to start it and it fails, its logs will likely tell you what the problem is. If not, then maybe trace or strace can give you a clue.

A system that is running, but slowly, can be hard to figure out. The slowness is either coming from the CPU or the hard drive. The "sar" program (available for Linux, too) will let you figure out which pretty quickly. First try"sar 5 5". That gives you information about your CPU. Then, "sar -d 5 5" will tell you about your hard drives. The only problem is that, unless it's flat obvious (CPU at 0% idle or hard drives showing 1,000 msec in the avserv column), you don't know if what you are seeing is normal or abnormal for this machine running its typical load. If sar has been set up and employed properly, you'll have historical data (in /var/adm/sa on SCO) to compare, but otherwise you have to just use your best judgement, and if you haven't had much experience on similar systems, you just won't know. See Sarcheck Review(May 1999) for more information.

If CPU is the problem, a simple "ps -e | sort -r +2 | head" can tell you a lot, particularly if you put it in a loop like this:

while true
do
ps -e | sort -r +2 | head
sleep 10
done
 

If that shows a process gaining a substantial amount of time during the 10 second sleep, that process is using a lot of time- if it gained 5 seconds, for example, it is using 50% of your cpu! You might also use "ps -ef" and look at the "C" column (just before the time)- that's total CPU usage.

Skunkware has "iohog", "memhog", and "cpuhog". These can help you pinpoint misbehaving processes. Download them all from ftp://ftp2.sco.com/pub/skunkware/osr5/vols/

If you are slow on a network connection, "netstat -in" and (on SCO) "llistat -l" can show you problems. Don't let DNS issues confuse you: if, for example, a telnet session is slow to connect but then works fine, that's DNS- either the connecting machine can't resolve the server's name or vice-versa. Remember that: the server is going to try to resolve the name of the client, and if it can't, there will be a delay- possibly a long delay.

Sometimes it helps to know just what equipment is in the machine. SCO has the "hw -v" command and "hwconfig", Windows has its Control Panel -> System -> Device Manager, and linux has lsdev, lspci, pnpdump and, of course, dmesg.

See Why is my system slow for more on this.

If you are having troubles, particularly during an installation, take out any hardware you don't need right now. You can always put it in later.

If you can identify a good-sized file that is not being used by anyone (check with fuser or lsof), running repeated "sum -r"'s on that file should always produce the same result. If it does not, and you are certain no-one else is modifying the file, then the hard disk, memory, disk controller, dma controller or cpu could be suspect. Gee, doesn't that narrow it down? Well, maybe we can get a little better:

The first time you sum the file, the information should be read from the hard drive directly unless it is already in cache( to effectively flush the cache, you need to know how big it is so you can sum, or just cat to /dev/null, some other, larger file which will overwrite any trace of this file). On SCO, the cache size can be found by

grep "i/o bufs" /usr/adm/messages
 

On linux "dmesg | grep cache" will get you what you need. If you pick your file size to be slightly less than cache, you've got a good shot at getting it all into memory (assuming no one else is using the machine just then).

The second time you sum, some portion or even all of the file will be read from memory, so if the sum's were constant after flushing cache, but not otherwise, memory would be suspect.

You might get a clue about DMA by testing other devices that use it- the floppy disk is a good candidate: if data can be reliably written and read from floppy (use sum -r to check it) but hard drive data is changing, then it's probably NOT memory or motherboard problems.

Things you need to be in single-user mode for

The first thing you will want to do is run fsck. Except for root, file systems should be unmounted when running this on them- you could umount /home while multiuser, of course, but since you are going to be checking all of them, you may as well be single-user. Be sure to check the man page; SCO, for example, doesn't do a full fsck on their modern versions unless you specify "-ofull" (see Filesystems for a more complete discussion of fsck).

What does fsck tell you? Well, if you have hard disk problems, fsck is going to trip over them, so that can be useful. Because fsck uses a fair amount of memory as it's working, memory problems will probably affect it -unfortunately, the effect may be unpleasant, so if you suspect that memory could be an issue, try creating a floppy file system first ( just "mkfs /dev/fd0" is sufficient ) and fsck it- that won't make you 100% sure that fsck isn't going to ruin your life for reasons beyond its control, but it gives you a little more confidence. I wouldn't fret over this- if you've had bad memory problems, things on your disk are likely not healthy already.

After fsck, you want to check the hard drive. SCO has "badtrk", Linux has "badblocks". Read the man pages carefully- you do NOT want to accidentally destroy your system, and these things can.

Rebooting

If your problems began after linking a new kernel, maybe you should boot with the previous version. On SCO, you automatically get "unix.old"; just type that at the Boot prompt. On Linux, you get the old kernel if you bothered to arrange for that in lilo.conf (or whatever your boot manger uses). Try hitting TAB at the boot prompt to see if you have other choices.

Logging

Maybe you still can't figure it out. Some file is getting trashed for unknown reasons and you can't get a handle on it. Perhaps you can set up some cron or background process to try to catch whatever is going on in the act. Something like this:

while true
do
sleep 300
sum -r suspect_file
fuser (or lsof) suspect file
done > /tmp/mylog &
 

You can get more sophisticated; if I have to leave this for days or weeks I send the sum to temporary files, compare them and delete them if nothing has changed; or I'll collect the fuser output and run it through "ps" so I can see who or what was responsible.

Seeing the invisible

Sometimes you need to see what's really going on without interpretation. For dumb terminals, you can often turn on a "monitor" mode in the terminal's setup that will then display the hex characters of any control characters rather than acting on them. For a telnet session, you can escape to the telnet prompt (generally CTRL-] ) and type "set netdata", then escape again and "set prettydump".

Another way to capture everything is to use "script". For example, "script /tmp/mystuff". Then do whatever you are doing, and when you are done, CTRL-D to end script. The /tmp/mystuff contains everything that happened, including control characters; you can examine it with vi or hd.

My Linux and Unix Troubleshooting E-Book covers much more of this.




If this page was useful to you, please help others find it:  





12 comments




More Articles by - Find me on Google+



Click here to add your comments
- no registration needed!




Tue May 31 11:25:50 2005: 596   anonymous


bad



Tue May 31 12:19:29 2005: 597   TonyLawrence

gravatar
I doubt anyone finds "bad" helpful. If you have specific complaints, list them: we'll try to improve. If you can do better, we'd be happy to publish your article.





Wed Jun 1 13:53:12 2005: 604   BigDumbDinosaur


Maybe by "bad" he meant "good," as in a "bad-ass motorcycle." <Grin>



Wed Feb 28 23:20:00 2007: 2883   anonymous


very helpful.



Thu Jan 8 21:58:43 2009: 5110   xcpe


Very Good article. I found your information very useful and insightful.







Tue Feb 23 12:49:18 2010: 8116   Avinash

gravatar


It was definitely helpful for me, thanks :)



Fri Jun 11 00:06:33 2010: 8696   Veera

gravatar


awesome information...must useful stuff. but i think its obsolete one...i need updated one. thanks to A.P.Lawrence.



Fri Jun 11 02:21:58 2010: 8697   TonyLawrence

gravatar


Other than perhaps mentioning grub instead of LILO, no, it's really not out of date, basic troubleshooting is pretty much the same.



Tue Jul 27 21:44:09 2010: 8863   anonymous

gravatar


Good One. Can you update this vesion



Tue Jul 27 22:21:08 2010: 8864   TonyLawrence

gravatar


My ebook http://aplawrence.com/troubleshootingbook.html is more current and has much more detail.



Fri Dec 24 16:53:10 2010: 9178   anonymous

gravatar


very useful thanks a lot




Don't miss responses! Subscribe to Comments by RSS or by Email

Click here to add your comments


If you want a picture to show with your comment, go get a Gravatar

Kerio Samepage


Have you tried Searching this site?

Unix/Linux/Mac OS X support by phone, email or on-site: Support Rates

This is a Unix/Linux resource website. It contains technical articles about Unix, Linux and general computing related subjects, opinion, news, help files, how-to's, tutorials and more. We appreciate comments and article submissions.

Publishing your articles here

Jump to Comments



Many of the products and books I review are things I purchased for my own use. Some were given to me specifically for the purpose of reviewing them. I resell or can earn commissions from the sale of some of these items. Links within these pages may be affiliate links that pay me for referring you to them. That's mostly insignificant amounts of money; whenever it is not I have made my relationship plain. I also may own stock in companies mentioned here. If you have any question, please do feel free to contact me.

I am a Kerio reseller. Articles here related to Kerio products reflect my honest opinion, but I do have an obvious interest in selling those products also.

Specific links that take you to pages that allow you to purchase the item I reviewed are very likely to pay me a commission. Many of the books I review were given to me by the publishers specifically for the purpose of writing a review. These gifts and referral fees do not affect my opinions; I often give bad reviews anyway.

We use Google third-party advertising companies to serve ads when you visit our website. These companies may use information (not including your name, address, email address, or telephone number) about your visits to this and other websites in order to provide advertisements about goods and services of interest to you. If you would like more information about this practice and to know your choices about not having this information used by these companies, click here.

pavatar.jpg

This post tagged:

       - Administration
       - Basics
       - Install/Upgrade
       - Performance
       - Troubleshooting















My Troubleshooting E-Book will show you how to solve tough problems on Linux and Unix systems!


book graphic unix and linux troubleshooting guide



Buy Kerio from a dealer
who knows tech:
I sell and support

Kerio Connect Mail server, Control, Workspace and Operator licenses and subscription renewals



Click and enter your name and phone number to call me about Kerio® products right now (Flash required)