APLawrence.com -  Resources for Unix and Linux Systems, Bloggers and the self-employed

Another RAID failure

© November 2004 Tony Lawrence

Must be something in the air. I've had another RAID failure. This time, it was a hardware RAID, specifically a seven year old DPT controller (DPT was subsequently bought by Adaptec).

The "Windows consultant" called me first, saying that he had come in and found the machine beeping, and realized this must be a drive failure. He also said that the backup had failed, and gave the usual apologetic "I'm not a Unix guy" (funny, though - he runs his own website on a Linux box). I understand the concern, but as I pointed out to him, this isn't an OS issue at all- the RAID is OS independent. However, I understood his worry about the backup because that could indicate something more serious like a controller or motherboard problem.

This is too important a system to leave to chance, so I cleared my schedule and drove down to the site. It's not that I don't trust the Windows guy, but I didn't want information filtered through a telephone - either from him or from me to him. Too easy to make an awful, irretrievable mistake.

Upon arrival, I ran the "dptmgr" and confirmed that indeed, ID 3 showed as failed. I also looked at the Microlite Edge printout and could see that the failure was just in one file - a Hard Read Error 6. It happened to be a log file, so if that was all it was, I wasn't too concerned. However, there are two places that could come from - either real read errors from the array, or file system inconsistency - the inode containing pointers to impossible blocks. I explained to the customer that the failed drive wouldn't cause real read errors - the RAID reconstructs the missing data. Therefore, if this really was bad reads, we had a very serious problem.

However, nothing in system logs (messages) had any disk read errors, so it looked like file system damage was the more likely cause. This would most likely be related to the RAID failure - the disk might not have failed instantly, and have caused some corruption as it died. If it truly was confined to that one file, we'd be fortunate indeed. I ran an "fsck -ofull" (SCO system) and sure enough, it identified problems with the same file Microlite BackupEDGE had complained about, and was able to clear everything out and give us back a good filesystem. That was a relief.

Now, of course, we needed to fix the failed drive. We had a bit of low comedy there - the last time I had seen the cabinet the drives were in was seven years ago, and I don't think the Windows guy had ever seen it. We couldn't figure out how to open it to get at the drives! But that wasn't what really bothered me. It was the replacement drives he had that had me worried. When we had originally installed these drives, we had tagged each drive with a paper sticky tag giving its ID. The drive he was proposing to replace the failed one with had such a tag on it, making me suspect that it was a bad drive previously removed from this box. However, we had nothing else - it's hard to find SCSI-3 drives off the shelf nowadays, so after finally figuring out how to get the old drive out, we put in the replacement and started the rebuild process. Based on the percentage counter, I knew it would take close to three hours for a rebuild. There's no reason the system couldn't be used while rebuilding, but the customer and the Windows guy said they'd prefer to just wait. I went along, and we went for a long lunch.

Shortly after we came back, the rebuild failed. I wasn't overly surprised. By now, we had found new drives which were on their way by Fedex, but there was little more we could do today. I told the customer to let people back on but to warn them that there was a small possibility of losing whatever they posted in that day (if we lost another drive, we'd be dead). I left.

The next morning, I called the customer again. He said that the backup had failed again. I asked for specifics, but was told there was no printout. I checked the Edge logs, and it looked to me like it had been interrupted part way through the verify. I asked if the database was "up" this morning (we shutdown the database before the backup and restart it when it is done). I was told, no, that the Windows guy had rebooted the machine this morning because the database wasn't running. I wish people wouldn't reboot machines - it's simple to start the database and I just can't stand the Windows "reboot fixes everything" mentality. Anyway, I could tell from the logs what happened - because the RAID was running degraded, it was much slower backing up. It just hadn't finished its verifty by the time the workday started - and to make it worse, some people had come in early because they lost so much time the day before. Since it hadn't finished, it hadn't restarted the database. I couldn't be 100% certain that the verify would have passed, but the backup had no errors, and the verify was OK up to the reboot anyway. I explained that much to the customer, and we reset the backup to start earlier as a temporary fix.

The new drive should be there tomorrow. Unless something really unfortunate happens, we should get this back in shape then.

Got something to add? Send me email.

(OLDER)    <- More Stuff -> (NEWER)    (NEWEST)   

Printer Friendly Version

-> Another RAID failure

1 comment

Inexpensive and informative Apple related e-books:

Take Control of Automating Your Mac

Photos: A Take Control Crash Course

Digital Sharing Crash Course

Take Control of Pages

Take Control of the Mac Command Line with Terminal, Second Edition

More Articles by © Tony Lawrence

---November 10, 2004

"I wish people wouldn't reboot machines - it's simple to start the database and I just can't stand the Windows "reboot fixes everything" mentality"

Isn't it funny how Windows admins always think rebooting a server will cure the problem? One of the first things I tell people who are taking their first steps with Linux, is that chances are the problem will be back after a reboot. I try to explain to them, that unlike windows, unless hardware is a problem, *you* are probably the problem, since you must have configured something wrong, which is why it is not working. I learned this early myself, and found only a few select times that a *nix system needs to be rebooted.

- Bruce Garlock

---November 10, 2004

Another interesting thing:

A failed raid 5 is more dangerous than a non-raid single drive machine.

For one thing, you have more drives, so therefore automatically more probability of failure. Second, you have more drive activity because of seeking to get parity info, so more heat and stress Finally, if another drive does fail, and you need to go for reconstruction, it's much more difficult and expensive to have someone do data recovery on a raid.

That's why I like to see it fixed asap.


---November 10, 2004

Wouldn't it be a good idea to start replacing the whole thing? Or at least start planning for it...

It seems to me that if you have several harddrives that were probably bought and have been running together for many many years, and then one of them goes it's probably a good indication on the health of the rest of the array.


If this installation done in seven years ago that means the drives that were used in the original build would have had, at best, a rated service life of five years. I'm not surprised that a failure occurred. What does surprise me is that the client was not advised by the Windows weenie that the drives should be changed out before they exceeded service life. If this machine is that important to the client, wouldn't they want to avoid this sort of trouble?

BTW, when you say "SCSI-3" drives, what are you referring to? SCSI-3 encompasses ultra-2 (80 MB/sec on LVD), ultra-160 and the current ultra-320 standard. The time of the build suggests U-2's of one sort or another, but you didn't mention the use of a hot-swap drive cage. I assume these would be in a cage, which means you would could replace the drives with current ultra-320 SCA units, gaining both a lot of capacity and improved service life.


---November 10, 2004

Yes, the whole thing has been planned to be replaced for two years now - it's now supposed to happen within months. Management was warned some time ago that they were living on borrowed time.

Not a hot-swap cage, unfortunately..


Have them come to me for a new server. <Smile> They'll get exactly what they should have gotten back then, complete with hot-swap cages and extra heavy duty drive cooling.


---November 11, 2004

They are moving to HP/UX :-)

Not my first choice, but the app vendor is most comfortable there.



Printer Friendly Version

Have you tried Searching this site?

This is a Unix/Linux resource website. It contains technical articles about Unix, Linux and general computing related subjects, opinion, news, help files, how-to's, tutorials and more.

Contact us

Printer Friendly Version

It is the the duty of a Webmaster to allocate URIs which you will be able to stand by in 2 years, in 20 years, in 200 years. (Tim Berners-Lee)

Linux posts

Troubleshooting posts

This post tagged:





Unix/Linux Consultants

Skills Tests

Unix/Linux Book Reviews

My Unix/Linux Troubleshooting Book

This site runs on Linode