APLawrence.com -  Resources for Unix and Linux Systems, Bloggers and the self-employed

SME server software raid failure, grub 0x10 error

An SME customer called this morning saying that his system had apparently stopped working (web pages and mail were unavailable) and therefore he had rebooted. Unfortunately, the grub boot would start to load the SME kernel and then fail with a 0x10 message. This was an "E-Machine", which was a choice I remember being unhappy about when it was first installed, but this customer is very price conscious and ignored my advice that better hardware would be smarter. Oh well.

As I had nothing better to do (yeah, right), I hopped in my car and drove down to RI to see this first hand. I should have looked up the error before getting in my car, but it was early and I hadn't had enough coffee yet. If I HAD looked it up, I would have quickly found this (from http://linux.derkeiler.com/Newsgroups/comp.os.linux.setup/2003-08/0074.html):


  0x00
  "Internal error". This code is generated by the sector read
  routine of the LILO boot loader whenever an internal
  inconsistency is detected. This might be caused by corrupt
  files. Reinstall IPCop or recreate the boot media.

  0x01
  "Illegal command". This shouldn't happen, but if it does,
  it may indicated an attempt to access a disk which is not
  supported by the BIOS.

  0x02
  "Address mark not found". This usually indicates a media
  problem. Try again several times.

  0x03
  "Write-protected disk". This should only occur on write
  operations.

  0x04
  "Sector not found". This typically indicates bad disk
  parameters in the IPCop PC's BIOS. If you are booting from
  a large IDE disk, you should check whether the IPCop PC's
  can handle the disk.

  0x06
  "Change line active". This sould be a transient error. Try
  booting a second time.

  0x07
  "Invalid initialization". The BIOS failed to properly
  initialize the disk controller. You should control the BIOS
  setup parameters. A warm boot might help, too.

  0x08
  "DMA overrun". This shouldn't happen. Try booting again.

  0x0C
  "Invalid media". This shouldn't happen and might be caused
  by a media error. Try booting again.

  0x10
  "CRC error". A media error has been detected. Try booting
  several times, and if all else fails, replace the media.

  0x11
  "ECC correction successful". A read error occurred by was
  corrected. LILO does not recognize this condition and aborts
  the load process anyway. A second load attempt should succeed.

  0x20
  "Controller error". This shouldn't happen.

  0x40
  "Seek failure". This might be a media problem. Try booting
  again.

  0x80
  "Disk timeout". The disk or the drive isn't ready. Either
  the media is bad or the disk isn't spinning. If you're
  booting from a floppy, you might not have closed the drive
  door. Otherwise, trying to boot again might help.

  0xBB
  "BIOS error". This shouldn't happen. Try booting again. 
 

Well, I felt it had to be hardware, so that would have just confirmed it, and I did feel that it was going to be easier to track this down on-site than trying to work with the client over the phone. Providence isn't very far away, so..

When I arrived on site, I just quickly confirmed that the symptoms were as told to me. Too many times I have have had someone tell me one thing and found something entirely different when on-site, but this time the error was accurately reported. Still lacking sufficient coffee, I sat down at a Windows machine and tried to call up Google.

Well, duh! The SME is the gateway to the internet! No gateway, no Internet, no Google. I shook my head in amazement and called Mitel support. In a very few minutes, I had one of the regular engineers on the phone. I explained that I would have looked this up myself if I had turned on my brain before getting in my car, and he laughed at me and did the search for me. In a few seconds, he told me it was most likely hardware.

I asked the customer for last night's DVD (we run Microlite Edge for backup here) but it wouldn't boot. That surprised me at the time, though later I found out why. I then asked for the boot recovery floppy we had created when the system was installed. That wouldn't boot either, which was upsetting. Finally, I asked if he had a recent Desktop Backup - he said yes, but when we tried to find it on his Windows machines, there was none.

Oh boy. Just the way I wanted things to work out - no backups, hardware boot error. Good thing it's only a 40 person office. Yes, I'm being sarcastic.

Fearing the worst, I inserted the SME install CD and rebooted. To me surprise, it saw the existing installation and offered to upgrade it. What the heck - I let it try, and it completed successfully. But the same 0x10 boot error came up. So, I booted that CD again, and this time when it got to the point of offering to upgrade, I did an ALT-F2 and had a shell prompt where I did a "cd /mnt/sysimage" and took a look around. All data was apparently intact, which meant that whatever hardware issue we had might be isolated to the boot files. I also now realized why the Edge DVD didn't boot: this is a software raid system, which Edge can't handle at the present time. We never told Edge to attempt a bootable backup because it can't.

But knowing that it was RAID gave me hope. Examining /proc/mdstat showed me:

Personalities : [raid1] 
read_ahead 1024 sectors
md2 : active raid1 hda3[1]
      262016 blocks [2/1] [_U]
      
md1 : active raid1 hda2[1]
      119684160 blocks [2/1] [_U]
      
md0 : active raid1 hda1[1]
      102208 blocks [2/1] [_U]
      
unused devices: <none>
 

The Mitel engineer explained that it should be showing [UU] for each line, and that the [_U] indicated a raid problem. At that point, I decided we should shut down the machine and open it up.

When we did that, I could immediately feel that the master ide drive was much hotter than the slave. The slave was warm, the master was uncomfortably hot. Touching the top of it with my finger made me feel I could blister my skin if I left it there long - it was that hot. I removed it, changed the jumper on the slave to make it the master, put the cable back, and buttoned the machine up. To my relief, it rebooted.

That's not a guarantee with RAID. If the hardware problem had caused data corruption prior to failing completely, the corruption would have been mirrored to the slave. Fortunately that was not the case here.

So we were back up - short one hard drive, but up and running. I asked the Mitel engineer if I needed to reinstall blades because of the "upgrade", but he explained that it wouldn't overwrite newer files.

I then took a look at the Edge backup files - the backup had been failing for the past 10 days. I chastised the customer for not alerting me to that problem but I realize that he's a busy guy and probably had other things on his mind. I left the system doing a Desktop Backup and advised the customer that they really should consider better hardware for such a critical system.



Got something to add? Send me email.





(OLDER)    <- More Stuff -> (NEWER)    (NEWEST)   

Printer Friendly Version

-> -> SME server software raid failure, grub 0x10 error




Increase ad revenue 50-250% with Ezoic


More Articles by

Find me on Google+

© Tony Lawrence




---August 19, 2004

Another case of someone who would rather put out the fire, instead of preventing it in the first place. We have all learned that critical systems should not have cheap hardware, and for some people, they need to really lose some serious money, before they realize the value of solid hardware. I won't even cheap out too much on hardware for my home anymore, after being down in the past because of a "good deal" on a not so name brand piece of hardware.

The E-machie may be a fine desktop, but server? I bet this company has one or two desktops that would probably make a better server than their current server, no?

What happened with the boot floppy made when the system was installed? Did it ever boot when the machine was installed? I have had way too many floppies to go bad on me over the years, and really do not trust them at all. I usually will take the image from a boot floppy, and place it on a bootable CD-ROM, with menu access to the image. See: http://aplawrence.com/Unixart/bgriprecovery.html

I keep lot's of different floppy images on this CD, all ready to boot in case of emergency. This CD gets a lot of use. It is also useful for Windows machines, which really need help from time to time. I have used it on very few Linux systems, and all of those where HW issues too.

- Bruce Garlock

---August 19, 2004

I dunno why the boot floppy wouldn't work, but I did try making a new one and that kept failing. It's not so critical with the SME because the install CD does give you a shell prompt on ALT-F2 long before it does anything destructive.

Customer is buying new hardware today :-)

--TonyLawrence

"0x10 'CRC error'.

As soon as I saw that I stopped reading and had a good chuckle -- both at Tony's expense for not tanking up on plenty of caffiene before shifting into Drive (although Providence is lovely this time of year -- spent many liberty hours there while stationed at Newport during my Navy days), and the client for being too cheap to get quality hardware. As always, you only get what you pay for, and with an E-Machine, that isn't much. Don't these people understand the cost/value relationship?

Also, IDE RAID is snake oil. As was quickly discovered, when a failure was imminent, the IDE RAID didn't really do its job. A quality SCSI RAID-5 setup would have booted even if one of the drives had failed to respond to SCSI inquiry and start-unit commands. In fact, I have demonstrated that to more than one skeptical client by unplugging one of the drives from the bus and flipping on the power.

BTW, it is possible that the floppy failure was due to a corrupted boot block in the BIOS. That area of the EEPROM is supposed to be write-protected and sacrosanct, present in the event a BIOS flash fails for any reason. The theory is that if the boot block stays intact, a minimal boot can be executed from a floppy, with the boot block code making few assumptions about the installed hardware. It would then be possible to reflash the BIOS to fix whatever was wrong.

The cheap-*redacted*, no-name motherboards used by E-Machines may have faulty EEPROM write circuitry or no protected boot block, which could have been overwritten at some time by a faulty BIOS flash -- or a BIOS CRC error could have occurred during the flash and was ignored. Did anyone attempt to upgrade the BIOS in the recent past?

"I then took a look at the Edge backup files - the backup had been failing for the past 10 days. I chastised the customer for not alerting me to that problem but I realize that he's a busy guy and probably had other things on his mind."

He won't be busy for long if his system craps out and he has no viable backups. It's real hard to conduct business when all of your data is as inaccessible as the far side of Uranus.

--BigDumbDinosaur

---August 19, 2004


Not disagreeing at all, but it's not IDE raid - it's Linux software raid. As for expense, well, I'd say this place has to run on a pretty tight purse - but of course that's no excuse for using really low-end stuff. I had an intelligent conversation with the owner - he said "I don't want to spend money I don't need to spend, but I don't want to be stupid about it either". I explained that I'm of the same mind, and would never ask him to spend an extra dollar if I didn't think it necessary.

Oh, and this part of Providence isn't all that beautiful :-)

--
Tony Lawrence

Hmmm...guess I didn't read closely enough the first time. Now I see where you mentioned software RAID. Now I know why my wife keeps hounding me to see the eye doctor.

"Oh, and this part of Providence isn't all that beautiful :-)"

Sounds like where one or two of my clients are located. How about the west side of Chicago? Baghdad isn't anymore dangerous.

--BigDumbDinosaur

---August 20, 2004

There's cheap, and then there's stupid cheap:

Ultimately, it wasn't the choice of an Emachine that led to the customer's problem, it was the choice of using software RAID instead of hardware RAID. My personal opinion is that software RAID is worse than no RAID at all. Recovery is always a challenge, and it seems like every system is different. With no RAID, the customer would probably be more diligent in attending to his backups, knowing that his company is no more secure than his latest backup. But what's especially distressing is that IDE RAID solutions from name brands such as Adaptec are *not* expensive. And on high-end Intel desktop motherboards (875P chipset), it adds maybe $20 to the cost for the RAID option. So they can still be cheap without being stupid about it.

--Bob

---August 20, 2004

I have had nothing but good experiences with Linux software raid. As I said, I just had to pull the drive and make the slave the master, and we were back up. Nothing hard about that.

--TonyLawrence



Kerio Samepage


Have you tried Searching this site?

Unix/Linux/Mac OS X support by phone, email or on-site: Support Rates

This is a Unix/Linux resource website. It contains technical articles about Unix, Linux and general computing related subjects, opinion, news, help files, how-to's, tutorials and more.

Contact us





It's a wonderful, wonderful opera, except that it hurts. (Joseph Campbell)

An editor is a person who knows precisely what he wants, but isn't quite sure. (Walter Davenport)












This post tagged: