This is based on a real incident, though the facts have been simplified a bit to make it easier to follow.
"Type cd space slash e t c", I yelled.
Somewhere in Ohio, a slightly frazzled service tech questioned what he'd heard: "cd slash e c t?"
When doing Unix support with a Windows user, I always try to be very patient and very friendly. Having a lousy phone connection doesn't help this process: it's hard to sound friendly when you have to shout. But at least we were making progress. When the call started, the machine had crashed and was not coming up. Actually, that turned out to be not quite true, but it looked that way to the people on-site, and, given the circumstances, that was a reasonable conclusion.
From their point of view, here's what happened: all serial printers suddenly stopped working. At the same time, a "PANIC: clfree - Free block " message appeared on the console screen. Being unable to do anything else, they powered off, and when the system came back up, it of course needed to run fsck. They did that, but the system just "went dead" after the fsck - no logins.
I was a little confused at first, because they told me this was a SCO 5.0.5 system, but you shouldn't see that clfree panic after 3.2v4.2. It's possible to have a similar problem on 5.0.5, but the message would be "PANIC: HTFS: Free block freed on HTFS" normally. Either way, I had a good idea what part of the problem was. The solution would be simple: get to single user mode, run "fsck -b -s -y /dev/root" or "fsck -ofull", possibly update some patches and we'd be done. But it wasn't going to be that easy.
In fact, the scratchy voice at the other end told me that he'd been trying to get to single user mode, but without any luck. No matter what he did, the system would either panic again, or just "go dead" on him, necessitating a power cycle reboot. Hmm. That didn't sound good. Maybe missing some important files, like inittab? But no, as I had him read me what he saw on the screen, it was apparent TCP/IP was starting up. He had no logins on the console, and had no Digiboard connnected terminals anyway, but I asked him to try to telnet in from his laptop, and to his surprise, he got in.
I knew what was wrong now: one of the rc scripts hadn't finished. Inittab is set to "wait" for the rc scripts to finish- if one does not finish, getty's never start on the console or on the Digiboards. If you have TCP/IP, that will have started before these scripts, so you can telnet in.
But which script? If this had been 5.0.5, I would have looked at /etc/rc2, as described at http://support.sco.com/rn_cgi/partneronline.cfg/php/enduser/std_adp.php?p_faqid=110825. However, a "uname -X" told me that this was 3.2v4.2 as I had suspected. So, I had the tech do this:
cd /etc/rc2.d ls -lut
(See Troubleshooting for the why behind that).
I asked him if all the dates he could see were the same. He said most were, but the last was dated several days ago. That told me that this script had never been reached, because some other script was hung. I then had him do
ls -lut | head -1
That told me that the LAST script executed was S88USERDEF. Taking an educated guess, I had him immediately do:
cd /etc/rc.d/8 ls -lut
and asked again if the dates were all the same. He said (as I expected) that there were only three files, "pcu", digscr", and "userdef", and that "userdef" had an old date on it. I asked if "pcu" was the first file listed, and he said it was. I asked him to look at it with "more", and he said it was "gibberish". That shouldn't be: that is a text script that is part of the Digi initialization. I asked him to edit it and put a "exit 0" as the first line, and then to type "reboot".
This time, the system happily came to a "Control-D" prompt. I had him put in the root password, and run "fsck -b -s -y /dev/root". That had to clear a lot of files, but I could tell from the modes that these were temporary files and named pipes, so I wasn't too concerned. After fsck finished, we went multi-user, and everything appeared to be working, except that none of the Digiboard printers worked. That didn't particularly surprise me, as we had short circuited an initialization file and may have had a defective board anyway. I asked the tech if he knew about "mpi" to run Digi's diags, and he was already familiar with that, so at this point I left him, suggesting that he at least should try downloading new Digi software, but that a better idea would be to put the printers on a print server and eliminate all need for serial ports. He agreed that was a good idea.
I do not know what caused the problem with the Digiboard file. I did suggest that this anomaly and the file system corruption might be an indication that the hard drives or memory could be failing, and cautioned that he should be religious about backups and consider an upgrade to new hardware as soon as possible. He assured me that was already planned.
The combination of the clfree panic and the /etc/rc.d/8 hanging made this a more difficult problem than it otherwise would have been.
More Articles by Tony Lawrence - Find me on Google+
Have you tried Searching this site?
Unix/Linux/Mac OS X support by phone, email or on-site: Support Rates
This is a Unix/Linux resource website. It contains technical articles about Unix, Linux and general computing related subjects, opinion, news, help files, how-to's, tutorials and more. We appreciate comments and article submissions.
Many of the products and books I review are things I purchased for my own use. Some were given to me specifically for the purpose of reviewing them. I resell or can earn commissions from the sale of some of these items. Links within these pages may be affiliate links that pay me for referring you to them. That's mostly insignificant amounts of money; whenever it is not I have made my relationship plain. I also may own stock in companies mentioned here. If you have any question, please do feel free to contact me.
Specific links that take you to pages that allow you to purchase the item I reviewed are very likely to pay me a commission. Many of the books I review were given to me by the publishers specifically for the purpose of writing a review. These gifts and referral fees do not affect my opinions; I often give bad reviews anyway.
We use Google third-party advertising companies to serve ads when you visit our website. These companies may use information (not including your name, address, email address, or telephone number) about your visits to this and other websites in order to provide advertisements about goods and services of interest to you. If you would like more information about this practice and to know your choices about not having this information used by these companies, click here.
Click here to add your comments
We used to get a lot of "PANIC: HTFS: Free block freed on HTFS" errors on our 5.0.5 machine, until I applied: (SLS) OSS647A, the Sdsk Supplement for that machine. I also had to apply the proper IBM ipsraid driver, and the machine has been fine since. After a couple of 3AM pages by our night supervisor, and many 'fsck's' later, I'm glad that those patches fixed the problem with that machine!
- Bruce Garlock
Don't miss responses! Subscribe to Comments by RSS or by Email
Click here to add your comments
If you want a picture to show with your comment, go get a Gravatar