APLawrence.com -  Resources for Unix and Linux Systems, Bloggers and the self-employed

Mysterious lockups


2006/12/04

Of all computer problems, the unresponsive hang is the most annoying and most difficult to trace. There's no crash, no panic: everything just stops dead. The keyboard is useless, telnets just time out - you have no choice but to power cycle the machine.

Well, maybe. If you are running Linux, and if you have Magic Sysrq enabled, you might be able to do more. Even SCO has something similar: scodb gives access to a kernel debugger if available.

But let's say none of that is helpful. In that case, the first thing you want to know is "how dead is it?". Is the keyboard totally dead - if it has lights for Caps-Lock, do they cycle on and off as you press that? If not, you may have a motherboard or keyboard problem. Can you "Control-ALT-F3" (Linux and SCO) to switch screens? If so, the OS is still at least partially alive. Can you telnet or ssh to the box? Can you ping it? Does Samba or NFS etc. still work? These give you clues as to the state of the networking stack.

Ok, you've given up. There's nothing that can be done but a power cycle. Here's another chance to possibly learn something: does a reset exhibit different behaviour than a complete power off? If it takes a power off to get the machine responsive again, how long does it have to be off? Short rest periods might indicate capacitor or register problems: giving the machine a little more time to "bleed off" cures the problem. A need for a longer period off might mean heat problems - are fans malfunctioning or are the insides coated with insulating dust?

No? OK, then maybe something in software is doing this. A "tail" of system logs may give a clue as to what was happening just before the hang (set Syslog MARK option if you aren't sure that stays running), as may tools like sar. A build up of unusual system activity prior to the hang might give clues as to its cause. If the hangs repeat, setting a "ps" running to log activity can help zero in on that should it happen again.

After all that though, these things are almost (*almost*) always hardware, and more often than not it is power related: bad power supplies are the most common cause I've seen. After that comes disk controllers and then motherboards, but nowadays I don't feel it's worth spending a lot of time chasing this sort of thing: move the system to new hardware as quickly as possible. If you then want to spend time investigating possibilities on the old hardware, at least you won't be interfering with normal business. However, given the cost of hardware vs. the cost of labor, even that may not make sense: accept that the whole thing was mysterious, do whatever you need to do to protect any confidential data, and move on. Maybe some parts can be recycled or maybe the machine can move down to less important use, but the cost of messing around with it in its original role just doesn't make sense.



Got something to add? Send me email.



3 comments



Increase ad revenue 50-250% with Ezoic


More Articles by

Find me on Google+

© Anthony Lawrence







Fri Mar 16 03:54:49 2007: 2916   MarkH


Hello AP Lawrence,

My SuSE Linux 10.1 system would hang every day. I tried adding the following three parameters to the kernal line in bootloader (via Yast):

   nosmp    noapic    nolapic 


Like magic, the problem went away. And, there does not appear to be any impact on performance that I can see. I do not know why this solution works. Nor have I tested it longer than a week so do not know whether it will hang in the future. I hope not.

Hope this helps,

Mark



Fri Mar 16 14:37:58 2007: 2917   BigDumbDinosaur


My SuSE Linux 10.1 system would hang every day. I tried adding the following three parameters to the kernal line in bootloader (via Yast):

nosmp noapic nolapic

Like magic, the problem went away. And, there does not appear to be any impact on performance that I can see. I do not know why this solution works. Nor have I tested it longer than a week so do not know whether it will hang in the future. I hope not.


The nosmp might be fixing a problem with an improperly compiled SMP kernel that is assuming that a second MPU is present. However, the others have me wondering.

The APIC or advanced programmable interrupt controller is a feature of all modern motherboards that extends the originally PC-AT 16 bit interrupt scheme. If you are having to kill APIC in the kernel in order to maintain stability then you either have a defective MPU, a temperature-related issue or possibly bad power (in other words, the symptoms described above in Tony's original article). Linux's handling of the APIC is inherently sound, so if it were me, I'd change out the motherboard and MPU for fresh parts.

APIC problems have been occasionally noted with some Pentium 4 processors. I'm not sure if these are symptoms of chip errata (manufacturing defects) or what. We stay away from P4's around here due to their long history of flaky performance under Linux.



Fri Mar 16 18:55:35 2007: 2918   MarkH



========

Hi and thanks for the comments. I've used 2 machines (identical) and both exhibit the same freezing problem. I would be surprised if the power supply was bad or the machines were overheating (in very cool room and the vents are clear) as no freezing happens after the kernal changes have been made. I've bought 10.2 and will try it over the weekend without Kernal changes to see if the problem lies with 10.1.

Thanks,

Mark

------------------------
Kerio Samepage


Have you tried Searching this site?

Unix/Linux/Mac OS X support by phone, email or on-site: Support Rates

This is a Unix/Linux resource website. It contains technical articles about Unix, Linux and general computing related subjects, opinion, news, help files, how-to's, tutorials and more.

Contact us





Two years from now, spam will be solved. (Bill Gates, 2004)

You learn about life by the accidents you have, over and over again, and your father is always in your head when that stuff happens. (Kurt Vonnegut)












This post tagged: