SME server software raid failure, grub 0x10 error
An SME customer called this morning saying that his system had
apparently stopped working (web pages and mail were unavailable)
and therefore he had rebooted. Unfortunately, the grub boot would
start to load the SME kernel and then fail with a 0x10 message.
This was an "E-Machine", which was a choice I remember being
unhappy about when it was first installed, but this customer is
very price conscious and ignored my advice that better hardware
would be smarter. Oh well.
As I had nothing better to do (yeah, right), I hopped in my car
and drove down to RI to see this first hand. I should have looked
up the error before getting in my car, but it was early and I
hadn't had enough coffee yet. If I HAD looked it up, I would have
quickly found this (from
"Internal error". This code is generated by the sector read
routine of the LILO boot loader whenever an internal
inconsistency is detected. This might be caused by corrupt
files. Reinstall IPCop or recreate the boot media.
"Illegal command". This shouldn't happen, but if it does,
it may indicated an attempt to access a disk which is not
supported by the BIOS.
"Address mark not found". This usually indicates a media
problem. Try again several times.
"Write-protected disk". This should only occur on write
"Sector not found". This typically indicates bad disk
parameters in the IPCop PC's BIOS. If you are booting from
a large IDE disk, you should check whether the IPCop PC's
can handle the disk.
"Change line active". This sould be a transient error. Try
booting a second time.
"Invalid initialization". The BIOS failed to properly
initialize the disk controller. You should control the BIOS
setup parameters. A warm boot might help, too.
"DMA overrun". This shouldn't happen. Try booting again.
"Invalid media". This shouldn't happen and might be caused
by a media error. Try booting again.
"CRC error". A media error has been detected. Try booting
several times, and if all else fails, replace the media.
"ECC correction successful". A read error occurred by was
corrected. LILO does not recognize this condition and aborts
the load process anyway. A second load attempt should succeed.
"Controller error". This shouldn't happen.
"Seek failure". This might be a media problem. Try booting
"Disk timeout". The disk or the drive isn't ready. Either
the media is bad or the disk isn't spinning. If you're
booting from a floppy, you might not have closed the drive
door. Otherwise, trying to boot again might help.
"BIOS error". This shouldn't happen. Try booting again.
Well, I felt it had to be hardware, so that would have just
confirmed it, and I did feel that it was going to be easier to
track this down on-site than trying to work with the client over
the phone. Providence isn't very far away, so..
When I arrived on site, I just quickly confirmed that the
symptoms were as told to me. Too many times I have have had someone
tell me one thing and found something entirely different when
on-site, but this time the error was accurately reported. Still
lacking sufficient coffee, I sat down at a Windows machine and
tried to call up Google.
Well, duh! The SME is the gateway to the internet! No gateway,
no Internet, no Google. I shook my head in amazement and called
Mitel support. In a very few minutes, I had one of the regular
engineers on the phone. I explained that I would have looked this
up myself if I had turned on my brain before getting in my car, and
he laughed at me and did the search for me. In a few seconds, he
told me it was most likely hardware.
I asked the customer for last night's DVD (we run Microlite Edge
for backup here) but it wouldn't boot. That surprised me at the
time, though later I found out why. I then asked for the boot
recovery floppy we had created when the system was installed. That
wouldn't boot either, which was upsetting. Finally, I asked if he
had a recent Desktop Backup - he said yes, but when we tried to
find it on his Windows machines, there was none.
Oh boy. Just the way I wanted things to work out - no backups,
hardware boot error. Good thing it's only a 40 person office. Yes,
I'm being sarcastic.
Fearing the worst, I inserted the SME install CD and rebooted.
To me surprise, it saw the existing installation and offered to
upgrade it. What the heck - I let it try, and it completed
successfully. But the same 0x10 boot error came up. So, I booted
that CD again, and this time when it got to the point of offering
to upgrade, I did an ALT-F2 and had a shell prompt where I did a
"cd /mnt/sysimage" and took a look around. All data was apparently
intact, which meant that whatever hardware issue we had might be
isolated to the boot files. I also now realized why the Edge DVD
didn't boot: this is a software raid system, which Edge can't
handle at the present time. We never told Edge to attempt a
bootable backup because it can't.
But knowing that it was RAID gave me hope. Examining
/proc/mdstat showed me:
Personalities : [raid1]
read_ahead 1024 sectors
md2 : active raid1 hda3
262016 blocks [2/1] [_U]
md1 : active raid1 hda2
119684160 blocks [2/1] [_U]
md0 : active raid1 hda1
102208 blocks [2/1] [_U]
unused devices: <none>
The Mitel engineer explained that it should be showing [UU] for
each line, and that the [_U] indicated a raid problem. At that
point, I decided we should shut down the machine and open it
When we did that, I could immediately feel that the master ide
drive was much hotter than the slave. The slave was warm, the
master was uncomfortably hot. Touching the top of it with my finger
made me feel I could blister my skin if I left it there long - it
was that hot. I removed it, changed the jumper on the slave to make
it the master, put the cable back, and buttoned the machine up. To
my relief, it rebooted.
That's not a guarantee with RAID. If the hardware problem had
caused data corruption prior to failing completely, the corruption
would have been mirrored to the slave. Fortunately that was not the
So we were back up - short one hard drive, but up and running. I
asked the Mitel engineer if I needed to reinstall blades because of
the "upgrade", but he explained that it wouldn't overwrite newer
I then took a look at the Edge backup files - the backup had
been failing for the past 10 days. I chastised the customer for not
alerting me to that problem but I realize that he's a busy guy and
probably had other things on his mind. I left the system doing a
Desktop Backup and advised the customer that they really should
consider better hardware for such a critical system.
Got something to add? Send me email.
Increase ad revenue 50-250% with Ezoic
More Articles by Tony Lawrence
Find me on Google+
© 2012-07-13 Tony Lawrence