clfree panic and no logins at console
This is based on a real incident, though the facts have
been simplified a bit to make it easier to follow.
"Type cd space slash e t c", I yelled.
Somewhere in Ohio, a slightly frazzled service tech questioned
what he'd heard: "cd slash e c t?"
When doing Unix support with a Windows user, I always try to be
very patient and very friendly. Having a lousy phone connection
doesn't help this process: it's hard to sound friendly when you
have to shout. But at least we were making progress. When the call
started, the machine had crashed and was not coming up. Actually,
that turned out to be not quite true, but it looked that way to the
people on-site, and, given the circumstances, that was a reasonable
From their point of view, here's what happened: all serial
printers suddenly stopped working. At the same time, a "PANIC:
clfree - Free block " message appeared on the console screen. Being
unable to do anything else, they powered off, and when the system
came back up, it of course needed to run fsck. They did that, but
the system just "went dead" after the fsck - no logins.
I was a little confused at first, because they told me this was
a SCO 5.0.5 system, but you shouldn't see that clfree panic after
3.2v4.2. It's possible to have a similar problem on 5.0.5, but the
message would be "PANIC: HTFS: Free block freed on HTFS" normally.
Either way, I had a good idea what part of the problem was. The
solution would be simple: get to single user mode, run "fsck -b -s
-y /dev/root" or "fsck -ofull", possibly update some patches and
we'd be done. But it wasn't going to be that easy.
In fact, the scratchy voice at the other end told me that he'd
been trying to get to single user mode, but without any luck. No
matter what he did, the system would either panic again, or just
"go dead" on him, necessitating a power cycle reboot. Hmm. That
didn't sound good. Maybe missing some important files, like
inittab? But no, as I had him read me what he saw on the screen, it
was apparent TCP/IP was starting up. He had no logins on the
console, and had no Digiboard connnected terminals anyway, but I
asked him to try to telnet in from his laptop, and to his surprise,
he got in.
I knew what was wrong now: one of the rc scripts hadn't
finished. Inittab is set to "wait" for the rc scripts to finish- if
one does not finish, getty's never start on the console or on the
Digiboards. If you have TCP/IP, that will have started before these
scripts, so you can telnet in.
But which script? If this had been 5.0.5, I would have looked at
/etc/rc2, as described at
OpenServer 5.0.5, system hangs just before the login prompt when booting to multiuser mode..
However, a "uname -X" told me that this was 3.2v4.2 as I had
suspected. So, I had the tech do this:
for the why behind that).
I asked him if all the dates he could see were the same. He said
most were, but the last was dated several days ago. That told me
that this script had never been reached, because some other script
was hung. I then had him do
ls -lut | head -1
That told me that the LAST script executed was S88USERDEF.
Taking an educated guess, I had him immediately do:
and asked again if the dates were all the same. He said (as I
expected) that there were only three files, "pcu", digscr", and
"userdef", and that "userdef" had an old date on it. I asked if
"pcu" was the first file listed, and he said it was. I asked him to
look at it with "more", and he said it was "gibberish". That
shouldn't be: that is a text script that is part of the Digi
initialization. I asked him to edit it and put a "exit 0" as the
first line, and then to type "reboot".
This time, the system happily came to a "Control-D" prompt. I
had him put in the root password, and run "fsck -b -s -y
/dev/root". That had to clear a lot of files, but I could tell from
the modes that these were temporary files and named pipes, so I
wasn't too concerned. After fsck finished, we went multi-user, and
everything appeared to be working, except that none of the
Digiboard printers worked. That didn't particularly surprise me, as
we had short circuited an initialization file and may have had a
defective board anyway. I asked the tech if he knew about "mpi" to
run Digi's diags, and he was already familiar with that, so at this
point I left him, suggesting that he at least should try
downloading new Digi software, but that a better idea would be to
put the printers on a print server and eliminate all need for
serial ports. He agreed that was a good idea.
I do not know what caused the problem with the Digiboard file. I
did suggest that this anomaly and the file system corruption might
be an indication that the hard drives or memory could be failing,
and cautioned that he should be religious about backups and
consider an upgrade to new hardware as soon as possible. He assured
me that was already planned.
The combination of the clfree panic and the /etc/rc.d/8 hanging
made this a more difficult problem than it otherwise would have
Got something to add? Send me email.
Increase ad revenue 50-250% with Ezoic
More Articles by Tony Lawrence
Find me on Google+
© 2012-08-02 Tony Lawrence