I was reminded of this subject recently when my iMac crashed and rebooted while I was trying to get a snapshot of a firewall screen to send to a customer. The firewall was running inside VMware Fusion, but I'm not sure what actually caused the crash. Someone from VMware looked at the crash log and said it might have been Citrix that was at fault.. we'll find out eventually, but my money is on it being Fusion because the crashes began right after an upgrade.
Applications can cause crashes. They usually do not because they usually aren't intimately involved with the operating system. Whether it's Linux, Windows or even ancient SCO Unix, a panic crash is not usually because of a malfunctioning application - applications that even try to touch things they should not are immediately killed off and that's the end of that.
However, virtualization products like Fusion do need deeper hooks, so I'm not arguing against it being at fault. Besides, it always crashes when I use Fusion and never otherwise. It's at different times and I'll be doing different things, but Fusion is involved somehow.
If only every panic crash were so easy to pin down.
Unfortunately, these abrupt disruptions may not have any obvious cause. The crash may just happen out of the blue and not repeat for days or even months. There may be a message on your screen that gives you a starting point - like a Trap E on a SCO Unix system or a Linux Oops but even that may leave you scratching your head in puzzlement.
Here's where you need to stop and ask yourself a question. Do you really want to go through the trouble of saving, extracting, and analyzing a dump? Do you even have the tools and knowledge to do so?
Don't misunderstand: maybe you do have the skills and knowledge. For must of us, a dump is nearly useless. The trap and cpu information is enough to tell us the type of problem and whether it is repeating (which could indicate a specific driver), but dumps can't give us much more. I can read assembly language fairly well, and I know a little bit about hardware and vm and so on, but even if I had source code to help me, I wouldn't bother with it, at least not initially. You need intimate knowledge and experience to get very far. The trap/cpu registers give enough for my teeny brain to work with, thank you.
Of course if you can get free help as I am getting with that VMware crash, that's different. But VMware is a product I bought. If the cause and effect weren't obvious, what could I do? Well, with Apple I could take it to a Genius Bar and they might be able to help me, but is anyone going to directly help you with some old SCO Unix or Linux system? Probably not - at least not for free.
Nothing wrong with trying to get free help, of course.
Unless you are wasting your time, and that's what the rest of this article is about: when NOT to waste your time.
I have included links to helpful kernel debugging articles later.
In my VMware crash, something definitely did change. VMware was upgraded. Let's take a not so hypothetical case where a system has been running for years, doing the same old boring stuff with the same old hardware and the same old operating system and suddenly it starts crashing once or twice a week.
That's almost certainly hardware. It may be aggravated by a badly written driver, but the cause is not likely to be software.
What hardware? Well, it could be almost anything, but without even having anything to go on and not having a bit of information from the panic, I'll bet RAM and I'll win that bet more often than I'll lose it.
The nice thing about that is that RAM is an easy thing to replace. It may cost you some money and some down time, but you don't have to reinstall any programs.
If it's easy to do, why not do it? If you have a spare of the same model NIC as is in the machine, changing it is one screw, right? Don't change that at the same time as you change RAM, though.
In fact, don't change anything yet. You've opened the machine; is it filled with dust? Dust insulates, heat builds, chips can malfunction. Maybe cleaning it out will help. Check for loose cables, too.
Do you have a spare box that is similar enough that you could transplant the hard drive(s)? If that stops the panic, great - you haven't identified the cause, but at least it was a quick fix. If it still panics or you don't have that luxury, we will keep looking.
A power supply is usually an easy and inexpensive swap and a definite suspect for crashes and freezes. So are disk cables.
Swapping the motherboard itself is more time consuming. Before trying that, see if it has jumpers for voltage. Maybe moving up .2 volts fixes the problem? If so, a new motherboard is definitely in your future.
Also, heat dissipation was not always the best on older equipment. A bigger fan might help - heat causes problems! I've temporaily "cured" systems by opening the side panel and setting a table fan to blow right at it all day and night.. obviously temporary, but it got us by.
SCSI hard drives and tapes can cause strange and intermittent problems. The cause is sometimes a matter of poor SCSI termination: get the system doing too much and it gets flaky.
I've heard of LUN probing causing panics. Adaptec had that as a BIOS setting; turning it off would stop panics.
Does the machine have an old serial card that you no longer use? A spare hard drive?
Get rid of them. Maybe all they are doing is sucking a little extra power, but maybe that's all it takes to give you grief.
Reading or writing a certain file causes a panic.
It's really a corrupt file system and a full fsck (fsck -ofull on old SCO Unix) should find it and fix it. Check bad disk blocks, of course. Removing the file will likely just cause the problem to turn up elsewhere.
We already know that a trap E indicates memory or a driver and that other traps can point us in different directions. What about more specific messages like "cannot exec /etc/init" >
That could be exactly what it says, but it could also be broken libraries. Remember that a panicing kernel is not in a state to go digging around very much - it doesn't trust itself to do much in the way of forensics.
Why on earth would you want to do that?
Usually it's to test dump recovery or train new operators. How you do that varies. For Linux, see Linux Crash HOWTO. SCO had various methods; for example you could uncomment the "mm panic" line in /etc/conf/node.d/mm (5.0.4 +). Relinking a kernel caused /dev/panic to be created and reading or writing that would cause an immediate panic.
Another way on SCO is "scodb". You will want to read Gathering information when SCO OpenServer 5.0.x system panics but does not produce a valid dump or the system hangs to learn about scodb.
Edit /etc/conf/sdevice.d/scodb to change the N to Y. Then cd to /etc/conf/cf.d and run ./link_unix and reboot to enable scodb.
After enabling scodb in the kernel, relinking and rebooting, you enter scodb (on a character mode screen only- NOT in a gui xterm) with either ctrl-x (pre 5.0.6) or ctrl-alt-d
Then, your two commands are:
For Mac OS x, see Kernel Core Dumps - Testing Your Configuration.
When SCO Unix paniced, it would write dots across the screen as it saved the dump file. There would be exactly 80 dots, so each dot represented 1/80th of your image. Therefore, the bigger your kernel, the longer that would take.
But remember that the kernel dearly wants to write that image. Suppose the disks aren't working right? According to this Panic save very sloooooow post by a SCO engineer, the system would still try:
The dump could be slow due to the nature of the panic: if it was caused by a problem in the disk drivers in the kernel, they might be in an insane state. Or there may be a problem with interrupt delivery. I believe the kernel dump code uses some tricks to continue writing even if interrupts are not being received from the host adapter to which the dump is being written; this would probably be slow (I'm not sure, have never seen it in action).
This is a possible fix, but probably doesn't apply to a system that has suddenly developed a problem. I remember a case where it did, though. An old SCO system, running fine for some time but then they added an lpd printer. After that, seemingly random crashes. The fix was simply a patch - oss497C in that case.
You really want to do this? OK, here are some helpful newsgroup posts:
Mac dumps a helpful file upon a crash. Here's the one I got with that VMware crash:
Sun Jul 21 06:43:18 2013 panic(cpu 2 caller 0xffffff802998f196): "m_copyback0: read-only"@/SourceCache/xnu/xnu-2050.24.15/bsd/kern/uipc_mbuf.c:5309 Backtrace (CPU 2), Frame : Return Address 0xffffff8189202ca0 : 0xffffff802961d626 0xffffff8189202d10 : 0xffffff802998f196 0xffffff8189202d90 : 0xffffff802998ebe6 0xffffff8189202db0 : 0xffffff80297d0ef80xffffff8189202e10 : 0xffffff8029970093 0xffffff8189202e30 : 0xffffff7fa9c99ffa 0xffffff8189202eb0 : 0xffffff802973a2de 0xffffff8189202fd0 : 0xffffff80297d0388 0xffffff81892032c0 : 0xffffff80297cda64 0xffffff81892032e0 : 0xffffff80297dd36c 0xffffff81892033c0 : 0xffffff80297d8138 0xffffff8189203a70 : 0xffffff80297cb11f0xffffff8189203ac0 : 0xffffff80297cb3a0 0xffffff8189203cd0 : 0xffffff80297caf7d 0xffffff8189203cf0 : 0xffffff802975f0bc 0xffffff8189203d20 : 0xffffff802974146b 0xffffff8189203db0 : 0xffffff802973f611 0xffffff8189203de0 : 0xffffff802973983c 0xffffff8189203e90 : 0xffffff802973fa3e 0xffffff8189203fb0 : 0xffffff80296b3137 Kernel Extensions in backtrace: com.deterministicnetworks.driver.dne(1.0.12)[CAF2B9C0-4B59-A819-822B-CC8B3B6BE138]@0xffffff7fa9c96000->0xffffff7fa9cd1fff dependency: com.deterministicnetworks.driver.dniregistry(1.0.4)[40D91812-180F-4AE9-10A9-CC1A0D7A9EDB]@0xffffff7fa9c81000 BSD process name corresponding to current thread: kernel_task Mac OS version: 12E55
I found that in /Library/Logs/DiagnosticReports/.
What to do with it?
But I'm not up to that, thank you.
When an process crashes, it dumps a core file. If you have source and a good understanding of programming, you might be able to use a debugger to find the problem. If not.. well, good luck to you. It's not that hard to find out where it went wrong, but figuring out why can be hellish. See "trace" in Unix and Linux Troubleshooting Tips
If you found something useful today, please consider a small donation.
Got something to add? Send me email.
More Articles by Anthony Lawrence © 2013-08-06 Anthony Lawrence
We're terrible animals. I think that the Earth's immune system is trying to get rid of us, as well it should. (Kurt Vonnegut)