Of all computer problems, the unresponsive hang is the most annoying and most difficult to trace. There's no crash, no panic: everything just stops dead. The keyboard is useless, telnets just time out - you have no choice but to power cycle the machine.
Well, maybe. If you are running Linux, and if you have Magic Sysrq enabled, you might be able to do more. Even SCO has something similar: scodb gives access to a kernel debugger if available. I don't know of anything like that for Mac OS X; there is ddb but that requires attaching a serial terminal and a recompiled kernel.
But let's say none of that is helpful. In that case, the first thing you want to know is "how dead is it?". Is the keyboard totally dead - if it has lights for Caps-Lock, do they cycle on and off as you press that? If not, you may have a motherboard or keyboard problem. Can you "Control-ALT-F3" (Linux and SCO) to switch screens? If so, the OS is still at least partially alive. Can you telnet or ssh to the box? Can you ping it? Does Samba or NFS etc. still work? These give you clues as to the state of the networking stack.
Ok, you've given up. There's nothing that can be done but a power cycle. Here's another chance to possibly learn something: does a reset exhibit different behaviour than a complete power off? If it takes a power off to get the machine responsive again, how long does it have to be off? Short rest periods might indicate capacitor or register problems: giving the machine a little more time to "bleed off" cures the problem. A need for a longer period off might mean heat problems - are fans malfunctioning or are the insides coated with insulating dust?
No? OK, then maybe something in software is doing this. A "tail" of system logs may give a clue as to what was happening just before the hang (set Syslog MARK option if you aren't sure that stays running), as may tools like sar. A build up of unusual system activity prior to the hang might give clues as to its cause. If the hangs repeat, setting a "ps" running to log activity can help zero in on that should it happen again.
After all that though, these things are almost (*almost*) always hardware, and more often than not it is power related: bad power supplies are the most common cause I've seen. After that comes disk controllers and then motherboards, but nowadays I don't feel it's worth spending a lot of time chasing this sort of thing: move the system to new hardware as quickly as possible. If you then want to spend time investigating possibilities on the old hardware, at least you won't be interfering with normal business. However, given the cost of hardware vs. the cost of labor, even that may not make sense: accept that the whole thing was mysterious, do whatever you need to do to protect any confidential data, and move on. Maybe some parts can be recycled or maybe the machine can move down to less important use, but the cost of messing around with it in its original role just doesn't make sense.
Have you tried Searching this site?
Unix/Linux/Mac OS X support by phone, email or on-site: Support Rates
This is a Unix/Linux resource website. It contains technical articles about Unix, Linux and general computing related subjects, opinion, news, help files, how-to's, tutorials and more. We appreciate comments and article submissions.
Many of the products and books I review are things I purchased for my own use. Some were given to me specifically for the purpose of reviewing them. I resell or can earn commissions from the sale of some of these items. Links within these pages may be affiliate links that pay me for referring you to them. That's mostly insignificant amounts of money; whenever it is not I have made my relationship plain. I also may own stock in companies mentioned here. If you have any question, please do feel free to contact me.
Specific links that take you to pages that allow you to purchase the item I reviewed are very likely to pay me a commission. Many of the books I review were given to me by the publishers specifically for the purpose of writing a review. These gifts and referral fees do not affect my opinions; I often give bad reviews anyway.
We use Google third-party advertising companies to serve ads when you visit our website. These companies may use information (not including your name, address, email address, or telephone number) about your visits to this and other websites in order to provide advertisements about goods and services of interest to you. If you would like more information about this practice and to know your choices about not having this information used by these companies, click here.
Click here to add your comments
Fri Mar 16 03:54:49 2007: Subject: MarkH
Hello AP Lawrence,
My SuSE Linux 10.1 system would hang every day. I tried adding the following three parameters to the kernal line in bootloader (via Yast):
Like magic, the problem went away. And, there does not appear to be any impact on performance that I can see. I do not know why this solution works. Nor have I tested it longer than a week so do not know whether it will hang in the future. I hope not.
Hope this helps,
Mark
Fri Mar 16 14:37:58 2007: Subject: BigDumbDinosaur
My SuSE Linux 10.1 system would hang every day. I tried adding the following three parameters to the kernal line in bootloader (via Yast):
nosmp noapic nolapic
Like magic, the problem went away. And, there does not appear to be any impact on performance that I can see. I do not know why this solution works. Nor have I tested it longer than a week so do not know whether it will hang in the future. I hope not.
The nosmp might be fixing a problem with an improperly compiled SMP kernel that is assuming that a second MPU is present. However, the others have me wondering.
The APIC or advanced programmable interrupt controller is a feature of all modern motherboards that extends the originally PC-AT 16 bit interrupt scheme. If you are having to kill APIC in the kernel in order to maintain stability then you either have a defective MPU, a temperature-related issue or possibly bad power (in other words, the symptoms described above in Tony's original article). Linux's handling of the APIC is inherently sound, so if it were me, I'd change out the motherboard and MPU for fresh parts.
APIC problems have been occasionally noted with some Pentium 4 processors. I'm not sure if these are symptoms of chip errata (manufacturing defects) or what. We stay away from P4's around here due to their long history of flaky performance under Linux.
Fri Mar 16 18:55:35 2007: Subject: MarkH
========
Hi and thanks for the comments. I've used 2 machines (identical) and both exhibit the same freezing problem. I would be surprised if the power supply was bad or the machines were overheating (in very cool room and the vents are clear) as no freezing happens after the kernal changes have been made. I've bought 10.2 and will try it over the weekend without Kernal changes to see if the problem lies with 10.1.
Thanks,
Mark
Don't miss responses! Subscribe to Comments by RSS or by Email
Click here to add your comments
If you want a picture to show with your comment, go get a Gravatar