Mysterious panics and freezes


2013/07/26

I was reminded of this subject recently when my iMac crashed and rebooted while I was trying to get a snapshot of a firewall screen to send to a customer. The firewall was running inside VMware Fusion, but I'm not sure what actually caused the crash. Someone from VMware looked at the crash log and said it might have been Citrix that was at fault.. we'll find out eventually, but my money is on it being Fusion because the crashes began right after an upgrade.

Applications can cause crashes. They usually do not because they usually aren't intimately involved with the operating system. Whether it's Linux, Windows or even ancient SCO Unix, a panic crash is not usually because of a malfunctioning application - applications that even try to touch things they should not are immediately killed off and that's the end of that.

However, virtualization products like Fusion do need deeper hooks, so I'm not arguing against it being at fault. Besides, it always crashes when I use Fusion and never otherwise. It's at different times and I'll be doing different things, but Fusion is involved somehow.

If only every panic crash were so easy to pin down.

Unfortunately, these abrupt disruptions may not have any obvious cause. The crash may just happen out of the blue and not repeat for days or even months. There may be a message on your screen that gives you a starting point - like a Trap E on a SCO Unix system or a Linux Oops but even that may leave you scratching your head in puzzlement.

Here's where you need to stop and ask yourself a question. Do you really want to go through the trouble of saving, extracting, and analyzing a dump? Do you even have the tools and knowledge to do so?

Don't misunderstand: maybe you do have the skills and knowledge. For must of us, a dump is nearly useless. The trap and cpu information is enough to tell us the type of problem and whether it is repeating (which could indicate a specific driver), but dumps can't give us much more. I can read assembly language fairly well, and I know a little bit about hardware and vm and so on, but even if I had source code to help me, I wouldn't bother with it, at least not initially. You need intimate knowledge and experience to get very far. The trap/cpu registers give enough for my teeny brain to work with, thank you.

Of course if you can get free help as I am getting with that VMware crash, that's different. But VMware is a product I bought. If the cause and effect weren't obvious, what could I do? Well, with Apple I could take it to a Genius Bar and they might be able to help me, but is anyone going to directly help you with some old SCO Unix or Linux system? Probably not - at least not for free.

Nothing wrong with trying to get free help, of course.

Unless you are wasting your time, and that's what the rest of this article is about: when NOT to waste your time.

I have included links to helpful kernel debugging articles later.

Has anything changed recently?

In my VMware crash, something definitely did change. VMware was upgraded. Let's take a not so hypothetical case where a system has been running for years, doing the same old boring stuff with the same old hardware and the same old operating system and suddenly it starts crashing once or twice a week.

That's almost certainly hardware. It may be aggravated by a badly written driver, but the cause is not likely to be software.

What hardware? Well, it could be almost anything, but without even having anything to go on and not having a bit of information from the panic, I'll bet RAM and I'll win that bet more often than I'll lose it.

The nice thing about that is that RAM is an easy thing to replace. It may cost you some money and some down time, but you don't have to reinstall any programs.

If it's easy to do, why not do it? If you have a spare of the same model NIC as is in the machine, changing it is one screw, right? Don't change that at the same time as you change RAM, though.

In fact, don't change anything yet. You've opened the machine; is it filled with dust? Dust insulates, heat builds, chips can malfunction. Maybe cleaning it out will help. Check for loose cables, too.

Do you have a spare box that is similar enough that you could transplant the hard drive(s)? If that stops the panic, great - you haven't identified the cause, but at least it was a quick fix. If it still panics or you don't have that luxury, we will keep looking.

Stuff you can swap

A power supply is usually an easy and inexpensive swap and a definite suspect for crashes and freezes. So are disk cables.

Swapping the motherboard itself is more time consuming. Before trying that, see if it has jumpers for voltage. Maybe moving up .2 volts fixes the problem? If so, a new motherboard is definitely in your future.

Also, heat dissipation was not always the best on older equipment. A bigger fan might help - heat causes problems! I've temporaily "cured" systems by opening the side panel and setting a table fan to blow right at it all day and night.. obviously temporary, but it got us by.

SCSI Devices

SCSI hard drives and tapes can cause strange and intermittent problems. The cause is sometimes a matter of poor SCSI termination: get the system doing too much and it gets flaky.

I've heard of LUN probing causing panics. Adaptec had that as a BIOS setting; turning it off would stop panics.

Unused devices

Does the machine have an old serial card that you no longer use? A spare hard drive?

Get rid of them. Maybe all they are doing is sucking a little extra power, but maybe that's all it takes to give you grief.

Corrupt file?

Reading or writing a certain file causes a panic.

It's really a corrupt file system and a full fsck (fsck -ofull on old SCO Unix) should find it and fix it. Check bad disk blocks, of course. Removing the file will likely just cause the problem to turn up elsewhere.

Panic messages

We already know that a trap E indicates memory or a driver and that other traps can point us in different directions. What about more specific messages like "cannot exec /etc/init" >

That could be exactly what it says, but it could also be broken libraries. Remember that a panicing kernel is not in a state to go digging around very much - it doesn't trust itself to do much in the way of forensics.

Deliberately panicing

Why on earth would you want to do that?

Usually it's to test dump recovery or train new operators. How you do that varies. For Linux, see Linux Crash HOWTO. SCO had various methods; for example you could uncomment the "mm panic" line in /etc/conf/node.d/mm (5.0.4 +). Relinking a kernel caused /dev/panic to be created and reading or writing that would cause an immediate panic.

Another way on SCO is "scodb". You will want to read Gathering information when SCO OpenServer 5.0.x system panics but does not produce a valid dump or the system hangs to learn about scodb.

Edit /etc/conf/sdevice.d/scodb to change the N to Y. Then cd to /etc/conf/cf.d and run ./link_unix and reboot to enable scodb.

After enabling scodb in the kernel, relinking and rebooting, you enter scodb (on a character mode screen only- NOT in a gui xterm) with either ctrl-x (pre 5.0.6) or ctrl-alt-d

Then, your two commands are:

sysdump()
and panic()

For Mac OS x, see Kernel Core Dumps - Testing Your Configuration.

SCO Panic dumps

When SCO Unix paniced, it would write dots across the screen as it saved the dump file. There would be exactly 80 dots, so each dot represented 1/80th of your image. Therefore, the bigger your kernel, the longer that would take.

But remember that the kernel dearly wants to write that image. Suppose the disks aren't working right? According to this Panic save very sloooooow post by a SCO engineer, the system would still try:

The dump could be slow due to the nature of the panic: if it was
caused by a problem in the disk drivers in the kernel, they might
be in an insane state. Or there may be a problem with interrupt
delivery. I believe the kernel dump code uses some tricks to
continue writing even if interrupts are not being received from
the host adapter to which the dump is being written; this would
probably be slow (I'm not sure, have never seen it in action).
 

Don't neglect patches and upgrades

Got updates?

This is a possible fix, but probably doesn't apply to a system that has suddenly developed a problem. I remember a case where it did, though. An old SCO system, running fine for some time but then they added an lpd printer. After that, seemingly random crashes. The fix was simply a patch - oss497C in that case.

Debugging SCO kernels

You really want to do this? OK, here are some helpful newsgroup posts:

double panic smp osr5 crash dump interpretation

OSR5 double panic smp hyperthreading p4

crash debugging, video interrupts, high resolution timers

Some notes on dmesg

Various application memory problems

Meaning of CheckSum: 0x02 * * Invalid * * in hw output

pfdfreeq ( re: vhand )

Panics in_freemsg & tcp_reass functions

Lessswap than RAM

Debugging MAC OS X

Mac dumps a helpful file upon a crash. Here's the one I got with that VMware crash:

Sun Jul 21 06:43:18 2013
panic(cpu 2 caller 0xffffff802998f196): "m_copyback0: read-only"@/SourceCache/xnu/xnu-2050.24.15/bsd/kern/uipc_mbuf.c:5309
Backtrace (CPU 2), Frame : Return Address
0xffffff8189202ca0 : 0xffffff802961d626
0xffffff8189202d10 : 0xffffff802998f196
0xffffff8189202d90 : 0xffffff802998ebe6
0xffffff8189202db0 : 0xffffff80297d0ef80xffffff8189202e10 : 0xffffff8029970093
0xffffff8189202e30 : 0xffffff7fa9c99ffa
0xffffff8189202eb0 : 0xffffff802973a2de
0xffffff8189202fd0 : 0xffffff80297d0388
0xffffff81892032c0 : 0xffffff80297cda64
0xffffff81892032e0 : 0xffffff80297dd36c
0xffffff81892033c0 : 0xffffff80297d8138
0xffffff8189203a70 : 0xffffff80297cb11f0xffffff8189203ac0 : 0xffffff80297cb3a0
0xffffff8189203cd0 : 0xffffff80297caf7d
0xffffff8189203cf0 : 0xffffff802975f0bc
0xffffff8189203d20 : 0xffffff802974146b
0xffffff8189203db0 : 0xffffff802973f611
0xffffff8189203de0 : 0xffffff802973983c
0xffffff8189203e90 : 0xffffff802973fa3e
0xffffff8189203fb0 : 0xffffff80296b3137
      Kernel Extensions in backtrace:
         com.deterministicnetworks.driver.dne(1.0.12)[CAF2B9C0-4B59-A819-822B-CC8B3B6BE138]@0xffffff7fa9c96000->0xffffff7fa9cd1fff
            dependency: com.deterministicnetworks.driver.dniregistry(1.0.4)[40D91812-180F-4AE9-10A9-CC1A0D7A9EDB]@0xffffff7fa9c81000

BSD process name corresponding to current thread: kernel_task
Mac OS version:
12E55
 

I found that in /Library/Logs/DiagnosticReports/.

What to do with it?

Debugging Mac Kernel links

But I'm not up to that, thank you.

Linux debugging

A quick overview of RedHat Linux kernel crash dump analysis

Analyzing Linux kernel crash dumps with crash - The one tutorial that has it all

Debugging application core files

When an process crashes, it dumps a core file. If you have source and a good understanding of programming, you might be able to use a debugger to find the problem. If not.. well, good luck to you. It's not that hard to find out where it went wrong, but figuring out why can be hellish. See "trace" in Unix and Linux Troubleshooting Tips

Also HowTo: Debug Crashed Linux Application Core Files Like A Pro.



Got something to add? Send me email.





(OLDER) <- More Stuff -> (NEWER)    (NEWEST)   

Printer Friendly Version

-> -> Mysterious panics and freezes




Increase ad revenue 50-250% with Ezoic


More Articles by

Find me on Google+

© Anthony Lawrence



Kerio Samepage


Have you tried Searching this site?

Unix/Linux/Mac OS X support by phone, email or on-site: Support Rates

This is a Unix/Linux resource website. It contains technical articles about Unix, Linux and general computing related subjects, opinion, news, help files, how-to's, tutorials and more.

Contact us





A man can be destroyed but not defeated. (Ernest Hemingway)

If you just want to use the system, instead of hacking on its internals, you don't need source code. (Andrew S. Tanenbaum)








This post tagged: