Well I was messing around on my computer I noticed that every once in a while a program would just get up and die all of a sudden for seemingly no reason. Then I noticed that when compiling big jobs my GCC compiler was segfaulting a awful lot.
So I know that when you get almost-random stuff going wrong like that, and you know that your using what should be a fairly stable OS the likely culprit is going to be flaky hardware. And out of flaky hardware the thing I hate the most is bad RAM modules, so that's what is most likely wrong.
So since it was debian I downloaded and installed memtest86 by typing into the console: apt-get install memtest86
Memtest86 is a very nice memory testing program for x86 machines. If you have a problem with memory hardware then this guy will find it. It'll check the L1 and L2 cache, it will check your memory modules and anything else that ends up as 'RAM' in your system.
How it works is that it boots up your computer, finds all the available memory, and then uses different patterns of bits and copies them from memory address to memory address in different fashions. These are 'tests' and it performs several of them on your computer. It takes while to complete the entire battery of tests, and once it's finished it simply starts over again at test one. It's best to let it run for a few hours because memory problems can be very intermittent.
If it finds any errors then it will tell you what memory range the test failed at.
Memtest86 is GREAT if your building a new computer and need to test the RAM. This is especially important with AMD64 machines and their touchy on-cpu-die built-in memory controllers. Sometimes reseating the RAM can fix problems, some memory sticks work in some motherboards and not others, sometimes simply moving the sticks to different slots will fix problems, or other times you need to underclock the machine to get it stable. Often you just get bad RAM and it needs to be replaced.
Well when apt-get installed memtest86 it copied it to my /boot directory and called it "/boot/memtest86.bin", then it modified my /boot/grub/menu.lst grub configuration file and added this entry:
That way I could simply reboot, select the memtest86 entry in my grub boot-time menu and then the program would run.
However this won't work for all machines. There are several ways to run memtest86. For windows machines you can use a floppy image and make a bootable floppy with dd or rawrite. They also have cdrom ISO images you can use to make bootable cdroms from.
So I rebooted the desktop, selected memtest86 entry and let that go for a couple hours.
As it turns out the main node had a clearly bad section of RAM! Now this sucked because I had a gig of ram in that machine and to fix it normally I would have to toss away (since any warranty on them is long-gone) a 512meg memory module and that's pretty expensive for me to do.
(That'll teach me to be sure to use a anti-static grounding bracelet in the future when I assemble machines.)
Normally this would be my only choice, but with Linux there are a couple tricks you can do to get a perfectly stable machine with a RAM modules that has clearly one bad section, and that's it. I wouldn't do it with a production server, but with my little home desktop so it doesn't make much of a difference. Plus it was just a small section that was bad and no other issues as far as I could tell.
A couple of the tricks revolve around kernel patches called BadRAM and BadMEM. Out of the two, BadMEM provides a lot of features and such, but BadRAM seemed simple and 'good enough'. (BadMEM was originally based off of BadRAM).
Basically, how the work is that they take the bad section of RAM and make it part of protected kernel memory space. This makes sure that no programs will accidentally access it and it's like that particular section of the RAM module might as well never really exist. It's a surprising effective and safe fix, and it only adds a couple dozen lines of code to the kernel.
The downside is that if it is a section of RAM that is naturally occupied by the kernel at boot-time then your probably SOL because it will corrupt the kernel and probably make your system unbootable. Sometimes you can work around it by moving the memory cards around, or by making a very small kernel with lots of modules instead of built-ins, then you can sometimes work around it.
So I rebooted back into Linux, downloaded the patch for my specific kernel, built it (took a couple tries) and rebooted into memtest86. My particular version (not sure if it's part of all memtest86 versions) has the ability to change it's error output from simply stating the affected memory space, but to print it out in a form that I can easy use with BadRAM-patched kernel parameters.
After about 5 minutes of running memtest86 spit out:
and so on and so forth. I let it run for another 45 minutes or so, but it didn't report any other bad sections so I rebooted.
In grub I hit "e" to edit my menu entry, selected the kernel line, hit "e" again, then modified my kernel entry from this:
kernel /vmlinuz-2.4.22-1.2199.nptl-ssi-686-smp devfs=mount hdb=ide-scsi hdc=ide-scsi root=/dev/hda2 ro
to look like this:
kernel /vmlinuz-2.4.22-1.2199.nptl-ssi-686-smp devfs=mount badram=0x13495568,0xfffffffc hdb=ide-scsi hdc=ide-scsi root=/dev/hda2 ro
I hit 'return' and then 'b' to boot. Once it booted up I made the change permanent by editing my boot config at /boot/grub/menu.lst and now I have a perfectly stable machine once again.
I figure that this would be especially useful for older machines that you may use for a firewall, a simple e-mail server, or something like that that may have become unstable due to memory errors. Or maybe if you have a Intel Pentium III (or was it 4?) that has the RAMBUS style ram that is incompatible with the much more common (and cheap) sdram or ddr sdram types.