Jim Mohr's SCO Companion
Copyright 1996-1998 by James Mohr. All rights reserved. Used by permission of
Be sure to visit Jim's great Linux Tutorial web site at http://www.linux-tutorial.info/
Using "Problem Solving" for this chapter was a very
conscious decision. I intentionally avoided calling it
"Troubleshooting" for several reasons. First,
troubleshooting has always seemed to me to be the process by which we
look for the causes of problems. Although that seems like a noble
task, so often finding the cause of the problem doesn't necessarily
mean finding the means to correct it or understanding the problem.
The next reason is that so often I find books where the
troubleshooting section is just a list of problems and canned
solutions. I find this comparable to the sign "In case of fire,
break glass." When you break the glass an alarm goes off and the
fire department comes and puts out your fire. What the real cause is
may never be known to you.
The troubleshooting sections that I find most annoying list out 100
problems and 100 solutions, but I usually have problem 101. Often I
can find something that is similar to my situation and with enough
digging through the manuals and poking and prodding the system I
eventually come up with the answer. Even if the answer is spelled
out, it's usually a list of steps to follow to correct the problem.
There are no details about what caused the problem in the first place
or what the listed steps are actually doing.
In this chapter, I am not going to give you list of known problems
and their solutions. The SCO documentation does that for you. I am
not going to try to give you details of the system that you need to
find the solution yourself. Hopefully, I did that in the first part
of the book. What I am going to do here is to talk about the
techniques and tricks that I've learned over the years to track down
the cause of problems. Also, we'll talk about what you can do to find
out where the answer is, if you don't have the answer yourself.
Problem solving starts before you have even installed your system.
Since a detailed knowledge of your system is important to figuring
out what's causing problems, you need to keep track of your system
from the very beginning. One of the most
effective problem solving tools costs about $2 and can be found in
grocery stores, gas stations and office supply stores. Interestingly
enough, I can't remember ever seeing it in a store that specializing
in either computer hardware or software. What I am talking about is a
notebook. Although a bound one will do the job, I find a loose leaf
one more effective since you can add pages more easily as your system
Included in the notebook is all the configuration information from
your system, the make and model of all your hardware and every change
that you make to your system. This is a running record of your
system, so the information should include the date and time as well
as the person making the entry. Every time you make a change, from
adding new software to changing kernel parameters, should be recorded
in your log.
Don't be terse with comments like, "Changed kernel parameter and
relinked." This should be detailed like, "Changed DTHASHQS
from 100 to 200. Relinked successful." Although it seems like
busy work, I also believe things like adding users and making backups
should be logged. If messages appear on your system, these too should
be recorded with details of the circumstance. The installation guide
contains an "installation checklist." I recommend
completing this before you install and keep a copy of this in the log
Something else that's very important to include in the notebook is
problems that you have encountered and what steps were necessary to
correct that problem. One support engineer at SCO told me he calls
this his "solutions notebook."
While you are assembling your system, write down everything you can
about the hardware components. If you have access to the invoice, a
copy of this can be useful for keeping track of the components. If
you have any control over it, get your reseller to include details
about the make a model of all the components. I have seen enough
cases where the invoice or delivery slip contains generic terms like
486 CPU, cartridge tape drive and 500MB hard disk. Often this doesn't
even tell you if the hard disk is SCSI, IDE or what.
Next, write down all the settings of all the cards and other hardware
in your machine. The jumpers or switches on hardware are almost
universally labeled. This may be something as simple as J3, but as
detailed as IRQ. SCO installs at the defaults on a wide range of
cards and generally they are few conflicts unless you have multiple
cards of the same type. However, the world is not perfect and you may
have a combination of hardware that neither I nor SCO has ever seen.
Therefore, knowing what all the settings are can become an
One suggestion is to write this information on gummed labled or cards
that you can attach to the machine. This way you have the information
right in fron of you every time you are working on the machine.
Many companies have a "fax back" service, where you can
call a number and have them fax you documentation to their products.
For most hardware, this is rarely more than a page or two. However
for something like the settings on a hard disk, this is enough. This
has a couple of benefits. First, you have the phone number for the
manufacturer of each of your hardware components. The time to go
hunting for it is not when your system has crashed. Next, you have
(fairly) complete documentation for your hardware. Lastly, by
collecting the information on your hardware you know what you have. I
can't count the number of times I have talked with customers who
don't even know what kind of hard disk, let alone the settings.
Another great place to get technical information is the World Wide
Web. I recently bought a SCSI hard disk that did not have any
documentation. A couple of years ago that might have bothered me.
However, when I got home I quickly connected to the Web site of the
driver manufacturer and got the full drive specs as well as a diagram
of where the jumpers are. If you are not sure of the company name,
take a guess like I did. I tried www.conner.com and it worked the
When it gets time to install the operating system, the first step is
read the release notes and installation guide. I am not suggestion
reading them cover to cover, but look through the table of contents
completely to ensure there is no mention of potential conflicts with
your host adapter or the particular way your video card needs to be
configured. The extra hour you spend doing that will save you several
hours later, when you can't figure out why your system doesn't reboot
when you finish the install. Oh, did I mention that you should read
the release notes and installation guide? You should. They're very
As you are actually doing the installation, the process of
documenting your system continues. Depending on what type of
installation you choose, you may or may not have the opportunity to
see many of the programs in action. If you choose an automatic
installation, then many of the programs are run without your
interaction, so you never have a chance to see and therefore document
The information you need to document is the same kinds of things we
talked about in the section on finding out how your system was
configured. This includes the hard disk geometry (dkinit),
divisions and filesystems (divvy),
the hardware settings (hwconfig)
and the kernel parameters (mtune
and stune). The output to
all of these commands can be sent to a file, which can be printed out
and stuck in the notebook.
I don't know how many times I have said it and how many articles it
has appeared in (both mine and from others), some people just don't
want to listen. They often treat their computer system like a new toy
at Christmas. They first want to get everything installed that is
visible to the outside world such as terminals and printers. In this
age of net-in-a-box, often that extends to getting their system on
the Internet as soon as possible.
Although being able to download the synopsis of the next Deep Space
Nine episode is an honorable goal for some, Chief O'Brien is not
going to come to your rescue when your system crashes. (I think even
he would have trouble with the antiquated computer systems of today)
Once you have finished installing the operating system, the
very first device you need to get installed and configured correctly
is your tape drive. If you don't have a tape drive, buy one! Stop
reading right now and go out and buy one. It has been estimated that
a down computer system costs a company, on the average, $5000.00 an
hour. You can certainly convince your boss that a tape drive costing
one-tenth as much is a good investment.
One of the first crash calls I got while I was in SCO Support was
from the system administrator at a major airline. After about 20
minutes, it became clear that the situation was hopeless. I had
discussed the issue with one of the more senior engineers who
determined that the best course of action was to reinstall the OS and
restore the data from backups.
"What backups?", I can still remember their system
administrator say. "There are no backups."
"Why not?" I asked.
"We don't have a tape drive."
"My boss said it was too expensive."
At that point the only solution was data recovery service.
"You don't understand," he said, "there is over
$1,000,000 worth of flight information on that machine."
"Not any more."
What is that lost data worth to you? Even before I started writing
this book, I bought a tape drive for my home machine. For me it's not
really a question of data, but rather time. I don't have that much
data on my system. Most of it can fit on a half-dozen floppies. This
includes all the configuration files that I have changed since my
system was installed. However, if my system was to crash, the time I
save restoring everything from tape as compared to reinstalling
from floppies, is worth the money I spent.
The first thing to do once the tape drive is installed is to test it.
The fact that it appears at boot says nothing about its
functionality. It has happened enough that it appears to work fine,
all the commands behave correctly and it even looks as if it is
writing to the tape. However, it is not until the system goes down
and the data is needed that you realize you cannot read the tape.
I suggest first trying the tape drive out by backing up a small
sub-directory such as
/etc/default. There are enough files to give the tape drive a
quick workout, but you don't have to wait for hours for it to finish.
Once you have verified that the basic utilities work (like tar
or cpio), then try backing
up the entire system. If you don't have some third party backup
software, I recommended that you use cpio.
Although tar can backup up
most of your system, it cannot backup device nodes. Don't use
something like dump, as
this simply makes an image of your filesystem and getting back
individual files is next to impossible.
I personally use a "super-tar" product on my system. These
are from third party vendors that not only provide an a very usable
interface, they often do bit level verification and provide complex
inclusion and exclusion mechanisms. The discussion of which one of
these super-tar products to use is almost religious. Like religion,
it's a matter of personal preference. I use Cactus' Lone-Tar because
I have a good relationship with the company president, Jeff Hyman,
who pops up regularly in the SCO Forum on CompuServe. Lone-Tar makes
backups easy to make and easy to restore. A working demo of Lone-Tar
(as well as other super-tar products) is available from the SCO
libraries on CompuServe.
After you are sure that the tape drive works correctly, then you
should create a boot/root floppy. A boot/root floppy is a pair of
floppies that you use to boot your system. The first floppy contains
the necessary files to boot and root floppy contains the root
Creating a boot/root floppy can be done by either using mkdev
fd or sysadmsh
on ODT or the Floppy Filesystem Manager on OpenServer. Here
you want to create both a boot and a root floppy. The primary reason
for making this just after you install the tape drive and before
anything else is simply a matter of space. If you have installed too
much, the kernel will not be able to fit on the floppy. Once you've
make the boot/root floppy set, test it. Also test the tape drive from
the floppy. Although not that common, I have seen cases where the
major and minor number on the floppy for the tape drive was
Now that you are sure that your tape drive and your boot/root floppy
set work, you can begin installing the rest of your software and
hardware. My preference is to completely install the rest of the
software first. This includes other SCO products. There is less to go
wrong with the software (at least little that keeps the system from
booting) and you can, therefore, install several products in
succession. When installing hardware, you should install and test
each component before you go on to the next one.
I think it is a good idea to make a copy of your link kit before you
make any changes to your hardware configuration. That way you can
quickly restore the entire directory and don't have to worry about
restoring from tape. There is a good example of how to copy an entire
directory tree on the cpio(C)
man-page. Use that example to copy the entire /etc/conf
sub-directory. I suggest using a name that is clearer than
/etc/conf.bak. Six months
after you create it, you have no idea how old it is or whether the
contents are still valid. If you name it something like
/etc/conf.06AUG95, then it
is obvious when it was created.
Now, make the changes and test the new kernel. After you are sure
that the new kernel works correctly, then make a new copy of the link
kit and make more changes. Although this is a slow process, it does
limit the potential for problems, plus if you do run into problems,
you can easily back out of it by restoring the backup of the link
As you are making the changes, remember to record all the hardware
and software settings for anything you install. Although you can
quickly restore the previous copy of the link kit if something goes
wrong, writing down the changes can be helpful if you need to call to
Once the system is configured the way you want, make a backup of the
entire, installed system on a different tape then just the base
operating system. I like to have the base operating system on a
separate tape in case I want to do some major revisions to my
software and hardware configuration. That way, if something major
goes wrong, I don't have to pull out pieces, hoping that I didn't
forget something. I have a known starting point that I can build
At this point you should come up with a backup schedule. The System
Administrator's Guide provides you with some guidelines on this. Keep
in mind that these are "guidelines." The information
provided there is not a list of unchangeable rules. You should backup
as often as necessary. If you can only afford to lose one day's worth
of work, then backing up every night is fine. Some people backup
once during lunch and once at the end of the day. More often than
twice a day may be too great a load on the system. If you feel like
you have to do it more often, you might want to consider disk
mirroring or some other level of RAID. See the section on
filesystems for a discussion of the SCO implementation.
The type of backup you do is dependent on several factors. If it
takes ten tapes to do a backup, then doing a full backup of the
system (that is, backing up everything) every night is
difficult to swallow (You might consider getting a larger tape
drive). In a case where a full backup every night is not possible.
There are two alternatives.
First, there are incremental backups. These start with a master,
which is a backup of the entire system. Then the next backup only
records the things that have changed since the last incremental. This
can be expanded to several levels. Each level backups everything that
has changed since the last backup of that or the next lower level.
For example, level 2 backups up everything since the last level 1 or
the last level 0 (whichever is more recent). You might do a level 0
once an month (which is a full backup of everything), then a
level 1 every Wednesday and Friday and level 2 every other day of the
week. Therefore, on Monday, the level 2 will backup everything that
has changed since the level 1 on Friday. The level 2 on Tuesday will
backup everything since the level 2 on Monday. Then on Wednesday, the
level 1 backups up everything since the level 1 on the previous
At the end of the month, you do a level 0 which backsup everything.
Let's assume this is on a Tuesday. This would normally be a level 2.
The level 1 on Wednesday, backup up everything since the level 0 (the
day before) and not since the Level 1 on the previous Friday.
A somewhat simpler scheme uses differential backups. Here, there is
also a master. However, subsequent backups will record everything
that has changed (is different) from the master. If you do a master
once a week and differentials once a day, then something that gets
changed on the day after the master is recorded on every subsequent
A modified version of the differntial backup does a complete, level 0
backup on Friday. Then on eack of the other days, a level 1 is done.
Therefore, the backup Monday-Thursday will backup everything since
the day before. This is easier to maintain, but you may have to go
through 5 tapes.
The third type is the most simple, this is where you do a master
backup every day and forget about increments and differences. This is
the method I prefer since you save time when you have to restore your
system. With either of the other methods, you will probably need to
go through at least two tapes to recover your data, unless the crash
occurs on the day after the last master. If you do a full backup
every night, then there is only one backup to load. If the backup
fits on a single tape (or at most 2), then I highly recommend doing a
full backup every night. Remember that the key issue is getting your
people back to work as soon as possible. The average $5000 per hour
you stand to loose is much more than the cost of a large (8Gb) tape
This brings up another issue and that is rotating tapes. If you are
making either incremental or differential backups, then you must
have multiple tapes. It is illogical to make a master, the make
an incremental on the same tape. There is no way to get the
information from the master.
If you make a master backup on the same tape very night, you
can run into serious problems, as well. What if the system crashes in
the middle of the backup and trashes the tape? Your system is gone
and so is the data. Also if you discover after a couple of days that
the information in a particular file is garbage and the master is
only one day old, then it is worthless for getting the data back.
Therefore, if you do full backups every night, use at least five
tapes, one for each day of the week. (If you run seven days a week,
then seven tapes is a good idea)
Although most people get this far in thinking about tapes, many
forget about the physical safety of the tapes. If your computer room
catches fire and the tapes melt, then the most efficient backup
scheme is worthless. Some companies have fireproof safes that they
keep the tapes in. In smaller operations, the system administrator
can bring the tape home from the night before. This is normally only
effective when you do masters every night. If you have a lot of
tapes, you might consider companies that provide off-site storage
Checking the sanity of your system
Have you ever tried to do something and it didn't behave the way you
expected it to? You read the manual and typed in the example
character for character only to find it doesn't work right. Your
first assumption is that the manual is wrong, but rather than calling
in a bug to SCO Support, you try the command on another machine and
to your amazement, it behaves exactly as you expect. The only logical
reason is that your machine has gone insane.
Well, at least that's the attitude I have had on numerous occassions.
Although this personification of the system helps relieve stress
sometimes, it does little to get to the root of the problem. If you
want, you could check every single file on your system (or at least
the ones related to your problem) and ensure that permissions are
correct, the size is right and if all the support files are there.
Although this works in many cases, often what programs and files are
involved are not easy to figure out.
Fortunately help is on the way. SCO provides several useful tools to
not only check the sanity of your system, but to return it to normal.
The first set of tools we already talked about. These are the
monitoring tools such as ps,
these programs cannot correct your problems, they can indicate where
If the problem is the result of a corrupt file (either the contents
are corrupt or the permissions are wrong), the system monitoring
tools cannot help much. However, there are several tools that
specifically address different aspects of your system.
For starters, let's take the issue of incorrect permissions.
Under both ODT and OpenServer, there are two options: fixperm
and fixmog. The
advantage that fixperm
has is that, it can not only tell you about permissions
problems, but it can also you tell you the install status of the
different packages as well as create missing device nodes and
directories. On the other hand, fixmog
is fast since it is designed to fix security related problems.
Therefore, discrepancies in files such as vi
Fixperm uses what are
referred to as permissions lists. These are represented by the
files in /etc/perms. In
general, each file represents a single product. If you have TCP/IP,
for example, the files contained in the run-time TCP/IP product would
be found in /etc/perms/tcprt.
If you had the development system for TCP/IP, this would be
represented by the file /etc/perms/tcpds.
The exception is the operating system itself.
As I mentioned in an early chapter, each product can be broken down
into packages. These packages appear in different sections of the
permissions lists. Each file within that package is shown on a
separate line, like this:
RTS f644 root/sys 1
The fields are: package (here RTS for run-time system), type of file
and mode (here f for
regular file, and permission 644), user/group (root/sys),
links (1), path
(B02). Note that the
volume was originally designed to tell you what floppy disks this
file was one. However, if you have a tape or CD-ROM installation,
then everything is on one volume.
If you are a new administrator, you may not know what kind of media
you installed on. To find this out, look in /etc/perms/bundle/odtps
and find the "mediatype" entry.
If we have a file with more than one link (for example
than individual entries for each link, there is a single entry like
the one we have and then the name of each additional link is listed.
So, the entry for /usr/bin/mail
would look like this:
As it's making its checks, fixperm
sees that /usr/bin/mail
has two more links and can immediately check them. Since they
are links, they have are the same file and have the same permissions,
owner, group, etc., fixperm only needs to ensure that they all exist
and check the permissions on one of them. If it corrects the
permissions on one of them, it fixes them for all. For more details
on the different perms files, options, etc., see the fixperm(ADM),
and perms(F) man-pages.
Like fixperm, fixmog
has it's own database of information. This is the file
which represents the File Control Database. (See the section on
security for more information) Here we have the same basic
information as in the permissions list. However we are not concern
with packages, volumes or even links. All we are concerned with is
access permissions, owner, group and what type of file it is. In
addition to fixmog, you
can also use cps. This
works on a single file and not on the entire File Control Database as
One major problem that both fixperm
have is that they only check for the existence of the file, as
well a file attributes such as permissions and owner, but neither the
size or checksum of the file. This is a major issue when files
become corrupt. The permissions may be correct and even the size
might match, however if the file is corrupt, then the checksum is
most likely going to be wrong.
SCO provides a utility to compute a checksum on a file, called sum.
It provides three ways of determining the sum. The first is with no
options at all, which reports a 16-bit sum The next way uses the -r
option, which again provides a 16-bit checksum, but uses an older
method to compute the sum. In my opinion, this method is more
reliable since the byte order is important as well. Without the -r,
a file containing the word "housecat" would have the same
checksum if you changed that single word to "cathouse."
Although both words have the exact same bytes, they are in a
different order and give a (slightly) different meaning.
Because of the importance of the file's checksum, I created a shell
script while I was in SCO support that was run on a freshly installed
system. As it ran, it would store in a database all the information
provided in the permissions lists, plus the size of the file (from an
ls -l listing), the type
of file (using the file command)
and the checksum (using sum -r).
If I was on the phone with a customer and things didn't appear right,
I could do a quick grep of that file name and get the necessary
information. If they didn't match, then I knew something was out of
Unfortunately for the customer, much of the information that my
script and database provided was something that they didn't have
access to. Now, each system administrator could write a similar
script and call up that information. However, most administrators do
not consider this issue until it's too late. OpenServer corrected
much of that problem with the introduction of the Software Manager.
Not only can the Software Manager check for the existence of files,
but can also verify the checksum of the files. Many things can be
corrected automatically, but some require that you explicitely
request the Software Manger to make the corrections. For example,
let's assume that a fat fingered system administrator removed
software manager would tell you that the file was missing, but would
not automatically restore the link until you told it to fix the
If the file that we, as users, access is missing or corrupt (such as
/usr/bin/mail), then this
method works fine. However, if the file contained within the SSO is
missing or corrupt, then the situation is more serious. This is the
same situation you would have in ODT without the Software Manager.
You need the installation media. Although it's nice to have the
Software Manager tell you can't fix it and the user must do it by
hand, as of this writing there is nothing in the help files directly
accessed from the verify portion of the Software Manager that tells
you what to do when something is missing.
ODT was nice in that the same program that told you what file
(or files) was missing also allowed you to reinstall that missing
file. Some might be saying that the Software Manager tells you what's
missing and also lets you reinstall, so it does the same thing,
right? Unfortunately not. Although substantially more powerful
in many regards than custom
in ODT, OpenServer's Software Manager has broken some of the
basic functionality. The primary example is reinstallation of a
single file. This also extends to groups of files, even those that
are completely unrelated, which were both possible in ODT.
A solution would be to be able to reinstall the public portions of a
particular SSO/product. That way you replace the binaries. without
touching the data files. Unfortunately, that avenue has been blocked.
You can no longer reinstall a package, like you could in ODT.
Further, you cannot select from the list of packages on the system
(i.e., within an SSO) to say you want to install that package new
(after you have removed it). Instead you must first read the
installation media to get a list of software to install. If you have
another machine in the network with OpenServer, you can do a network
install of that package, which is a fair bit faster.
However, this doesn't really mean that the only way to get back
single files is reinstalling the package. Fortuantely, the SCO
engineers are not that viscious. They provided for us the
whose purpose it is is to extact files from an "SSO
distribution source." What this means is that it can pull of
individuals files from tape, CD or whatever you used to install. The
basic, and most commonly used, sytnax is:
Where <device> is
the device where the media resides and <SSO_path_name>
is the path name within the SSO, not the path we are
used to seeing. For example, to restore /usr/bin/vi
from a CD, the command might look like this:
customextract -m /dev/rcdt0
Note that the files are restored based on your current working
directory. Therefore, you might want to consider first changing
directories to / . If you
want to extract the files first and copy them into their proper
location by hand, you can change directories into /tmp.
In addition, you can specify a list of files to extract using the -f
option followed by the name of the file containing the list.
You can also use custom
in OpenServer to verify your software as well as correct
certain problems. These are the same kinds of problems that can be
corrected using the Software Manager. To be able to use this
functionality, you have to be familiar with the way SSOs are put
together. If you are still having trouble, we talked about SSOs in
the section on SCO basics.
An example of using custom
os OpenServer to verify and correct the Run-Time System (RTS) might
look like this:
-v SCO:Unix:RTS -x
This says I want to verify (-v)
the RTS package of the
UNIX product from the
manufacturer SCO. The
package name can be found in the SSO in the directory
in the files with the .fl
ending. For example, the file that the above command is
reading is /var/opt/K/SCO/Unix/5.0.0Cd/.softmgmt/RTS.fl.
These have a slightly different format than the old perms list, but
are fairly straight forward to read. See the custom(ADM)
man-page for more details.
Also new to OpenServer is the customquery
command. This is a very useful tool for not only finding out
what is installed, but also what versions. The basic syntax is:
where the functions include ListComponents,
and ListDescriptions. See
man-page for more details.
As we have just seen, there is a way to correct problems when
commands, utilities, and other applications are corrupt. If a data
file is corrupt, that's a different story entirely. It most cases, it
is impossible for the operating system to know what is valid data and
what is not. Therefore, it cannot be expected to be able to correct
such data corruption.
If you do have corruption in an applications data file, you need to
turn to the application vendor for possible means of correcting the
problem. Well, what if that vendor is SCO? If there was some
corruption in a data file used by an SCO program, they would be the
ones who would know how to correct it. In many cases, this is
impossible, There is no way the system could correct problems, let's
say, in the nameserver data files. There is a certain format the
files need to follow, but the name server relies on human
intervention to ensure that these files are created correctly. Since
there is nothing to compare these files to, (no reality check) there
is little that can be done to correct the problem.
Fortunately, in the case of the TCB, there is such a reality check.
This is the authck
utility. Not only does it understand the formats of the
different files and can identify problems, it can also correct many
of them. So important is the consistency of the TCB, that the system
runs authck every time you
boot. This is done by the shell script /etc/authckrc,
which is started by init
utility needs to be run either by root, or some other user
with the auth subsystem privilege. Note also that the chown kernel
privilege is also necessary if you want to make the changes that
discovers. There are options to check each of the primary TCB
databases. The -p options
checks the Protected Password Database, the -s
checks the Protected Subsystem Database and the
-t checks the Terminal Control Database.
If you plan on checking more that one database at a time, I recommend
either using the applicable option together (i.e., -ps
) or using the -a option
to check all databases. Since this is a lot quicker than checking
each database individually as the databases do not need to be
reloaded every time. If you want you can have authck
automatically correct problem it finds by using the -y
option, or by using the -n
option you can tell it to correct nothing. Also useful is the
-v option. This verbosely
outputs all the problems that authck
finds, whether you tell it to correct them or not.
We now get to the "sanity checker" that perhaps most people
are familiar with: fsck,
the filesystem checker. Anyone who has lived through a system crash
or had the system get shutdown improperly has seen fsck.
One unfamiliar aspect of fsck
is the fact that it is actually several programs, one for each of the
different filesystems. This is done because of the complexities of
analyzing and correcting problems on each filesystem. As a result of
these complexities, very little of the code can be shared. What can
be shared is found within the /etc/fsck
When it runs, fsck
determines what type of filesystem you want to check, and runs
the appropriate command. The "real" fsck
command as well as many other commands is found in
/etc/fscmd.d, where each
of the sub-directories is named for the filesystem type that the
commands are used on. Here you find a whole set of commands that are
used to access and manipulate a filesystem.
If you look, you will see that there are sub-directories for, among
others, ISO-9660 and ROCKRIDGE filesystems. Some of you may know
that these are filesystems found on CD-ROMs. CD-ROMs are read-only
filesystems. What's the point of running fsck
on a filesystem that is read-only? Even if it was corrupt,
there is no way to correct things. What purpose does it server to
have fsck even look at
Well, there is no point. That is the point. So much so that there is
program for these filesystems. If you looked in
would not see an fsck
program. In fact, here are only two programs there: fstyp
and mount. These
are actually links to the same files for the other CD-ROM filesystem:
Regardless of what kind of filesystem you are checking, fsck
is a very complex program because the structure it is cleaning
and trying to correct is very complex. Depending on the filesystem
type you are checking, fsck
goes through up to eight different phases. For HTFS, EAFS, AFS, and
S51K filesystems the phases are the same, but for DTFS filesystems,
there are different. Therefore, we are going to first talk about the
phases for HTFS, EAFS, AFS, and S51K.
Phase 0 is new to OpenServer. This is because is during this
phase that the intent log is replayed if intent logging has been
enabled. Unless you specify a full check, outstanding transactions
are completed and the filesystem is marked as clean. Since the
filesystem is clean at this point, there is no need to check further
and fsck exits.
In phase 1 of a full check, fsck
checks the inode table. Part of what is done is comparing the size of
the file to the number of blocks allocated on the hard disk. In
addition, fsck checks to
ensure that the number of links to this file is not zero. This could
possibly mean that someone had removed the file and the system did
not have time to update the inode table before the system went down.
Here, too, fsck checks the
sanity of the disk blocks. If one of the data blocks pointed to in
the inode is outside of the boundaries of the filesystem, then fsck
knows that something is wrong. The result is that fsck
removes the incorrect information, which means that the file is
removed, as well.
Another key aspect of phase 1 is searching for duplicate blocks. This
is where more than one inode point to the same data block on the hard
disk. Don't think that this is what a link is. A link is a file name
that points to the same inode. Here we have multiple inodes that
point to the same data. This is not supposed to happen. Once the
situation is corrected, fsck checks again to see if there are more
duplicate blocks. This is phase 1b.
Phase 2 is used to clean-up what was found in phase 1. For example,
if we find there are files with duplicate blocks, fsck
has no choice but to remove both files. This is accomplished in phase
2. Here, too, fsck cleans
up directory entries that either point to inodes that are empty or
ones that are non-existent. Non-existent inodes occur when the inode
is larger than the maximum possible on the filesystem or is negative.
You have only as many inodes as were allocated when the filesystem
In phase 3, file connectivity is checked. For example, let's say an
inode exists and points to valid data but there is no directory entry
on the system for it. This is referred to as an unreferenced
file. Something needs to get done with it, so it is placed in the
/lost+found directory. This is normally in the root directory
of the filesystem. It is then given a new name, which is simply its
This phase brings up a few interesting issues. First, if it is
a directory that has no entry in some other (parent) directory, it
too will be placed in lost+found.
Now there is a sub-directory of /lost+found
with its name being the inode of that sub-directory. However, the
contents of the directory "file" are intact. Therefore, the
file name-inode pairs in this directory are intact. You can cd
into that directory and see all the file names as if nothing had
every happened. I have seen it where such a directory was
/bin (or was at it
/usr/bin?). All that had to be done was to rename it /bin
(or /usr/bin) and
things were back to normal.
Next, there can only be a maximum number of unreferenced
files found on your system. Remember that each directory is simply a
file in a specific format. All that is done during this phase of
fsck is that the directory
file is being filled with the names of the "lost" files. As
I mentioned in the section on filesystem, the size of the
lost+found directory does not change during fsck.
Therefore, you run into problems if a lot of unreferenced files are
encountered. See the section on filesystem for more details.
If the unreferenced file is empty, there is no logic in
placing it in lost+found.
Therefore, when fsck
encounters an unreferenced, zero-length file it will prompt
you to clear the inode. (Unless you used the -y
option, in which case it is cleared automatically) This can be
very upsetting to some administrators because often this phase
reports a large number of unreferenced, zero-length files. No
worries. These are usually only unnamed pipes that have been created
on the filesystem. (The HPPS does this all in memory and avoids this
In Phase 4, fsck checks
the link count of files. For example, if there are only two files
that reference a particular inode, but the link count in that inode
is 3, then fsck needs to
During Phase 5, fsck
examines the free-block list to resolve any missing or unallocated
blocks. Once all inconstancies have been corrected, fsck
rebuilds the free list, in phase 6.
As I mentioned before, fsck
checks and repairs a DTFS on it's own. Because of the nature of the
filesystem, there are only four phases. In phase 1, fsck
reads the inode bitmap and initializes the block bitmap. In phase 2,
the inodes are "validated." Part of this process is to
ensure that the B-tree structure of each file is maintained, so that
the tree remains balanced.
In phase 3, fsck rebuilds
the directory structure. Remember from our discussion on filesystem I
mentioned that the inode of each file contains a pointer to the
parent directory. Using this, fsck
can easily rebuild the directory structure and there is no need for a
lost+found directory since
files are not lost. If the inode is trashed, then the disk blocks
exist without an inode. In that case, there is no way to rebuild them
into the original file and the blocks can simply be returned to the
free list and the block bitmap is updated. In phase 4, the superblock
You may find yourself cleaning a large filesystem, where all the
necessary table cannot fit into memory. As a result, fsck
requires a scratch file to store the tables. This can
be a real file on some other filesystem or you define a separate
filesystem just for scratch. If you know that the filesystem is too
large, then you can specify the scratch file on the command line with
the -t option. If you
will recognize the need for a scratch file and prompt you. If
you don't have a special scratch device you can simply used
System problems fall into several categories. The first is difficult
to describe and even more difficult to track down. For lack of a
better word, I am going to use the word "glitch." These are
problems that occur infrequently and in circumstances that are not
easily repeated. These can be caused by anything from users with fat
fingers to power fluctuations that change the contents of memory.
Next are special circumstance in software that are detected by the
CPU while in the process of executing a command. We discussed these
briefly in the section on kernel internals. These are traps, faults,
and exceptions. Many of these events are normal parts of system
operation and are, therefore expected. This includes such things as
page faults. Other events like following an invalid pointer are
unexpected and will usually cause the process to terminate.
What if it is the kernel that causes either a trap, fault, or
exception? As I mentioned in the section on kernel internals, there
are only a few cases when the kernel is allowed to do this. If this
is not one of those cases, the situation is deemed so serious that
the kernel must stop the system immediately to prevent any further
damage. This is a panic.
When the system panics, it uses it last dying breath the run a
special routine that prints the contents of the internal registers
onto the console and dumps the contents of RAM onto the swap device.
At the end, it will call the kernel function haltsys(
) which stops the system.
Despite the way it sounds, if your system is going to go down, this
is the best way to do it. The rationale behind that statement is that
when the system panics in this manner, there is a record of what
happened. First, there is the dump image on the swap device. Second,
there is the register dump on the console screen. Both of these are
essential pieces of information when your system goes down.
If the power goes out on the system, then it is not really a system
problem in the sense that it was cause by an outside influence.
Similar to someone pulling the plug or flipping the circuit breaker
(which my father-in-law did to me once). Although this kind of
problem can be remedied with a UPS, the first time the system goes
down before the UPS is installed can make you question the stability
of your system. There is no record of what happened and unless you
know the cause was a power outage, it could have been anything.
Another annoying situation is when the system just "hangs."
That is, it stops completely and does not react to any input. This
could be the result of a bad hard disk controller, bad RAM, an
improperly written or corrupt device driver. Since there is no record
of what was happening, try to figure out what went wrong is extremely
difficult. Especially if it is very sporadic.
Since a system panic is really the only time we can easily track down
the problem, I am going to start there. First thing to think about is
the fact that as the system is going down it does two things: writes
the registers to the console screen and write a memory image the dump
device. The fact that it does so as it's dying make me think that
this is something important. Which it is.
The first thing to look at is the instruction pointer. This is
actually composed of two registers: the CS (code segment) and EIP
(instruction pointer) registers. This is the instruction that the
kernel was executing at the time of the panic. By comparing the EIP
of several different panics, you can make some assumptions about the
problem. For example, if the EIP is consistent across several
different panics, this indicates that there is a software problem.
The assumption is made because the system was executing the same
piece of code every time it panics. This usually indicates a
On the other hand, if the EIP consistently changes, then this
indicates that probably no one piece of code is the problem and it is
therefore a hardware problem. This could be bad RAM or something
else. Keep in mind, however, that a hardware problem could cause
repeated EIP values, so this is not a hard coded rule.
If your system was able to successfully write the dump image to the
swap device, then things may be a little simpler. If so, you can
examine the dump image and find out the name of the function the
system was in when it panicked. This is a lot more useful than just
an EIP, because the function name can point to something more
specific. Say for example the system panicked inside of the
Sdskintr() routine, I know
that this has to do with the Sdsk (SCSI hard disk) driver. Therefore,
I might consider a hardware problem with the SCSI hard disk.
When the system boots after a panic, it recognizes that there
is a dump image on the dump device, provided the dump device is
/dev/swap. At that time
you have the option of saving the image, removing it or simply
leaving it alone. If you leave it alone, it remains on the dump
device until you decide to remove it. Note that the first time you
swap, the image gets trashed.
When you do boot, you need to go into single user mode (to ensure you
don't start swapping). Once there, run crash
-d /dev/swap (assuming
/dev/swap is the dump
device). When you reach the prompt (>),
simply type in panic. This
will give you a stack trace of the last few system calls executed
before the crash. The top one will be the one the system was running
when it crashed. If it is not obvious from the name of the function,
you can look through the Driver.o
files in /etc/conf/cf.d
using either nm or
strings to find out which
driver the function is in.
Why do I specifically tell you to look in the device driver files?
Well, since 99% if the time the cause of the panic is a
corrupt or poorly written device driver, this is a good place to
One system panic does not necessarly tell you what the problem was.
If you run crash and the panic
command says that there is something wrong with the Srom
driver, you cannot assume there wrong with your CD-ROM drive.
(Assuming you knew what the Srom driver was.) It could just as easily
be a bad sport of RAM that was not detected as having a parity error.
If you have multiple panics that seem to point to the same thing, but
cannot figure out exaclty what the cause is. You should run crash
-d <dumpdevice> -o <file> to save the output and
run the follow functions inside of crash:
panic - prints
the panic information
- prints a kernel stack track
- prints uarea of the active process when the system panicked
- prints the process table when the system panicked.
is providing you with your support, should be able to make some sense
The problem with this approach is that the kernel is generally loaded
in the same way all the time. That it, unless you change something,
it will occupy the same area of memory. Therefore, it's possible that
bad RAM makes it look like there is a bad driver. The way to verify
this is to change where the kernel is loaded in physical. You can do
this by either re-arraigning the order of your memory chips or using
the mem= option to boot
to limit what memory is accessed.
Keep in mind that this technique probably may not tell you what SIMM
is bad, only indicate that you may have a bad one. The only sure fire
test is to swap out the memory. If the problem goes away with new RAM
and returns with the old RAM, you have a bad SIMM.
you are unfortunate to have the system hang or even reboot itself,
then there is no dump image to look at and no EIPs to compare. The
first place to look is the system log file, /usr/adm/messages.
Even if the system did panic, there may be some information there to
indicate what went wrong. This is often in the form of a kernel
Kernel messages fall into five categories and usually have the
category: name: routine message
The category can be one of the following, in increasing order of
FATAL, and PANIC.
Although not always present, the name
represents the device driver or sub-system name having
problems. If it is a device driver, you will probably see the major
and minor numbers of the offending devices. This makes tracking down
the problem a lot easier. The routine
portion is also not always present and usually is not as obvious as
the major and minor number. However, you can still attempt to track
down the device by looking through the Driver.o
A CONFIG message normally indicates that the value of one of the
kernel tunable parameters has been exceeded. This will be followed by
the kernel parameter in question and the value that was exceeded. The
remedy is to either increase the parameter or limit access to that
resource. For example, if you are running ODT or OpenServer with a
maximum value for the size of the process table, you might get a
message that says you have exceeded that limit. To correct this,
either increase the size of the table (NPROC on ODT or MAX_PROC on
OpenServer) or limit the number user so you don't run out of
processes. On the other hand, if the problem is caused by some
run-away process, rebooting the system might correct the problem. See
the section on monitoring system activity for more details.
is somewhat more urgent that a CONFIG message. This indicates that a
situation has occurred that should be monitored. For example, running
out of space or inodes on a filesystem would generate a NOTICE.
Normally, this is not associated with any kernel parameter so a
relink and reboot is not necessary. An example of this would be:
Srom: Not ready on SCSI CD-ROM 0 dev 51/0 (ha=0 id=5 lun=0)
For some reason, my CD-ROM is not ready. This might indicate a
hardware problem. However, in my case this came from the fact that I
automatically mount a CD-ROM during boot up and this time there was
no CD in the drive. But, these can also indicate a more serious
problem. For example,
Sdsk: Unrecoverable error reading SCSI disk 0 dev 1/0 (ha=0 id=0
When I ran scsibadblk
(used to check bad blocks on a SCSI device) this block came
up as bad. I was personally upset at treating a bad block as
something comparable to not having a CD in the drive. In my opinion
this error requires a higher level message, like a warning.
WARNINGs may require immediate attention. The keyword is "may."
It's possible that the warning could be something harmless like this:
floppy: Disk is write protected in fd0 dev 2/60
On the other hand it could be something dramatic as:
floppy: Read error on dev 2/60, block=20 cmd=0x03 status=0x01
Which indicates a problem that might make me lose data. To me a write
protected floppy is as the same level as not having a CD in the
drive. Being able to determine the severity of the problem from these
messages is not always simple.
FATAL errors are not happy things. These can be the result of
hardware problems such as:
Parity error in the motherboard memory
This means to need to replace some of your RAM. However, this kind of
message can be the result on fat fingers. For example:
Bad bootstring syntax - kernel.auito
Look carefully. There is an extra 'i' in "autio." This was
the result of me wanting to go right into multi-user mode by typing
unix.auto at the Boot:
prompt. Because /boot
didn't know what to do with my typo, I got this message. Normally,
FATAL messages usually appear before the system panics. Therefore it
is a good idea to keep track of these messages. This is also a case
where the FATAL message is immediately followed by:
Illegal bootstring, cannot continue
Some things are just so picky.
Lastly is our old friend PANIC. When it gets to this stage, things
are too severe to continue. It is rare that software is the cause,
but do not discount it completely. Corrupt software can cause panics
as well as drivers that were designed for a different release of the
operating system. If you have just installed some new hardware, which
requires a relink and reboot, then your system panics, this is a good
sign that there is a hardware problem. Check the release of the
driver to make sure it is support. If it is, swap out the hardware.
If the kernel is in the process of panicking and something else
occurs that would normally cause a panic, then a double panic occurs.
Although this sounds a bit more serious, they may have the same
cause. Therefore, treat a double panic as you would a single panic.
Getting to the
heart of it
Okay, so we know what types of problems can occur. How do we correct
them? If you have a contract with a consultant, this might be part of
that contract. Take a look at it and read it. Sometimes the
consultants are not even aware of what is in their own contracts. I
have talked to customers who have had consultant charge them for
maintenance or repair of hardware, insisting that it was an extra
service. However, the customer would whip out the contract and show
them that this was included.
If you are not fortunate to have such an expensive contract, then you
will obviously have to do the detective work yourself. If the printer
catches fire, then it is pretty obvious where the problem is.
However, if the printer just stops working figuring out what is wrong
is often difficult. Well, I like to think of problem solving the way
Sherlock Holmes described it in "The Seven Percent Solution"
(and maybe other places):
"Eliminate the impossible and whatever is left over, no matter
how improbable, must be the truth."
Although this sounds like a basic enough statement, it is often
difficult to know where to begin eliminating things. In simple cases,
we can begin by eliminating almost everything. For example, suppose
we were having system hangs every time we used the tape drive. It
would be safe at this point to eliminate everything but the tape
drive. So, the next big question is whether it is hardware problem or
Potentially that portion of the kernel containing the tape driver was
corrupt. In this case, simply relinking the kernel was enough to
correct the problem. Therefore, when you relink, you link in a new
copy of the driver. If that is not sufficient, then restoring the
driver from the distribution media is the next step. However, based
on your situation, checking the hardware might be easier, depending
on your access to the media.
If this tape drive requires its own controller and you have
access to another controller or tape drive, you can swap components
to see if the behavior changes. However, just like you don't want to
install multiple pieces of hardware at the same time, you don't want
to swap multiple pieces. If you do and the problem goes away, was it
the controller or the tape drive? If you swap out the tape drive and
the problem goes away that would indicate that the problem was in the
tape drive. However, does the first controller work with a different
tape drive? You may have two problems at once.
If you don't have access to other equipment that you can swap, then
there is little that you can do other than verifying that it is not a
software problem. I have had at least one case while in SCO Support
where a customer would call in insisting that our driver was broken
because he couldn't access the tape drive. Since the tape drive
worked under DOS and the tape drive was listed as supported, either
the doc was wrong or something else was. Relinking the kernel and
replacing the driver had no effect. We checked the hardware settings
to make sure there were no conflicts, but everything looked fine.
Well, we had been testing it using tar
the whole time, since tar
is quick and easy when you are trying to do tests. When we ran
a quick test using cpio,
the tape drive worked like a champ. When we tried outputting tar
to a file, it failed as well. Once we replaced the tar
binary, everything worked correctly.
If the software behaves correctly, then there is the potential for
conflicts. This only occurs when adding something to the system. If
you have been running for some time and suddenly the tape drive stops
working, then it is unlikely that there are conflicts. Unless, of
course, you just added some other piece of hardware. If problems
arise after adding hardware, remove it from the kernel and see if the
problem goes away. If they don't go away, remove the hardware
physically from the system.
Another issue that people often forget is cabling. I have done
it myself where I had a new piece of hardware and after relink and
reboot, something else doesn't work. After removing it again, the
other piece still doesn't work. What happened? When I added the
hardware, I loosened the cable on the other piece. Needless to say,
pushing the cable back in fixed my problems.
I have also seen cases where the cable itself is bad. One support
engineer report a case to me where just pin 8 was bad. Depending on
what was being done, the cable might work. Needless to say, this
problem was not easy to track down.
Potentially the connector on the cable is bad. If you have something
like SCSI, where you can change the order on the SCSI cable without
much hassle this is a good test. If you switch hardware and the
problem moves from one device to the other, this could indicate one
of two things. Either the termination is messed or the connector is
If you do have a hardware problem often times it is the result of a
conflict. If your system has been running for a while and you just
added something, then it is fairly obvious what is conflicting. If
you have trouble installing, then it is not always clear. In such
cases, the best thing is to remove everything from your system that
is not needed for the install. In other words, strip your machine to
the "bare bones" and see how far you get. Then add one
piece at a time, once the problem re-occurs, you know you have the
When trying to track down a problem yourself, remain calm. Keep in
mind that if the hardware or software is as buggy as you now think it
is, the company would be out of business. It's probably one small
point in the doc that you skipped over (if you even read the doc) or
there is something else in the system conflicting with it. Getting
upset does nothing for you. In fact, (speaking from experience)
getting upset can cause you to miss some of the details that you're
As you are trying to track down the problem yourself, examine the
problem carefully. Can you tell if there is a pattern to when/where
the problem occurs? Is the problem related to a particular piece of
hardware? Is it related to a particular software package? Is it
related to the load that is on the system? Is it related to the
length of time the system has been up? Even if you can't tell what
the pattern means, the support rep has one or more pieces of
information to help in tracking down the problem. Did you just add a
new piece of hardware or SW? Does removing it correct the problem?
Did you check to see if there are any HW conflicts such as base
address, interrupt vectors and DMA channels?
I have talked to customers who were having trouble with one
particular command. They insist that it does not work correctly and
therefore the is a bug in either the software or the doc. Since they
were reporting a bug, we allowed them to speak with a support
engineer even though they did not have a valid support contract. They
keep saying that the doc sucks because the SW did not work the way it
was described in the manual. After pulling some teeth, I discovered
that the doc the use is for a product that was several years old. In
fact, there had been three releases since then. They were using the
latest software, but the doc was from the older release. No wonder
the doc didn't match the software.
It may happen that the system crash you just experienced, no longer
allows you to boot your system. What then? The easiest solution (at
least easiest in terms of figuring out what to do) is reinstalling.
If you have a recent backup and your tape drive is fairly fast, this
is a valid alternative, provided there is no hardware problem causing
In an article I wrote for SCO's DiSCOver magazine, I compared a
system crash to an earthquake. The people that did well after the
1989 earthquake in Santa Cruz were the ones that were most prepared.
The people that do well after a system crash are also the one that
are best prepared. Like an earthquake, the first few minutes after a
system crash are crucial. The steps you take can make the difference
between a quick, easy recovery and a forced re-install.
In previous sections we talked about the different kinds of problems
that can happen on your system, so there is no need to go over them
again here. Instead we will concentrate on the steps to take after
you reboot your system and find that something is wrong. It's
possible, that when you reboot, all is well and it will be another
six months before that exact same set of circumstances occurs. On the
other hand, your screen may be full of messages as it tries to brings
itself up again.
Because of the urgent nature of system crashes and the potential loss
of income, I decided that this was one troubleshooting topic I would
hold your hand on. There is a set of common problems that occur after
a system crashes that need to be addressed. Although the cause of a
the crash can be a wide range of different events, the results of the
crash is small by comparison. With this in mind, and the importance
of getting your system running again, this is one place where I am
going to forget what I said about giving you cookbook answers to
Let's first talk about those cases where we can no longer boot at
all. Therefore, you need to think back to our discussion of starting
and stopping the system and consider the steps the system goes
through when booting. I talked about them in details before, so I
will only review them here as necessary to describe the problems.
As I mentioned, when you power on a computer, the first thing is the
Power-On Self-Test, or POST. If something is amiss during the POST,
you will usually hear a series of beeps. Hopefully, there will be
some indication on your monitor of what the problem is. It can be
anything from incorrect entries in your CMOS to bad RAM. If not,
maybe the hardware documentation has something about what the beeps
When finished with the POST, the computer executes code that looks
for a device from which it can boot. On an ODT or OpenServer system,
this boot device will more than likely be the hard disk. The built in
code finds the active partition on the hard disk and begins to
execute the code at the beginning of the disk. What happens if the
computer cannot find a drive to boot from is dependent on your
hardware. Often there will be a message indicating that there is no
bootable floppy in drive A. It is also possible that the system
If you have a hard disk installed and it should contain valid
data, then potentially your masterboot block is corrupt. If you
created the boot/root floppy set like I told you, then you can use
fdisk from it to recreate
the partition table using the values from your notebook. Load the
system from your boot/root floppy set and run fdisk.
This is done like from the hard disk. With the floppy in the drive
you boot your system. When you get to the Boot:
prompt, you simply press entry. After loading the kernel it
prompts you to insert the root filesystem floppy. You do that and
press enter. A short time later, you are brought to # prompt, from
where you can begin to issue commands.
When you run fdisk, what
you will probably see is an empty table. Because you made a copy of
your partition table in your notebook like I told you to do, you
simply fill in the values exactly the way they were before. Be sure
that you make the partition active that was previously so. Otherwise,
you won't be able to boot or you could still boot but you corrupt
your filesystem. When you exit fdisk,
it will write out a copy of the master boot block to the
beginning of the disk. When you reboot, things will be back to
(I've talked to aleast one customer who literally laughed at me when
I told him to do this. He insisted that it wouldn't work and that I
didn't know what I was talking about. Fortunately for me, each time I
suggested it, it did work. However, I have worked on my
machines myself where it didn't work. With a success rate well over
50%, it's obviously worth a try.)
However, if you did not follow my friendly advice and write down the
fdisk parameters, all is
not lost. When you installed a copy of you SCO system it made a copy
of the masterboot block and stored it as /etc/masterboot.
When you create a boot/root floppy set, this file is copied onto the
root floppy. By using dd¸
which is also on your root floppy, you can rewrite the
masterboot block. The command would be:
/bin/dd bs=376 count=1
This means that dd
will copy one 376-byte block from /etc/masterboot
to /dev/rhd00, which is
your boot hard disk. Be careful when you type this. If you mis-type
the size and make it 736 or type count=10
you'll need to get out your installation media and start over.
One thing that I would like to point out is that the /etc/masterboot
file is not updated. If you still have root on your hard disk
and later add partitions, you could change the if
and of entries in
the above command, which will update the /etc/masterboot
file to reflect your current masterboot block. If you have written
down the configuration, then this is not a problem. You can have a
corrupt or outdated /etc/masterboot
and it won't matter. You always use fdisk
and input the values by hand.
Some of you might be thinking that once the system is
installed, the partition table isn't going to change. Well, if you
have used the entire partition for UNIX, this is probably true.
However, if you are like me and have multiple operating systems, the
issue is not so simple. Once, I had two DOS partitions and a single
ODT 3.0 partition on my first drive. When OpenServer arrived I did
not want to remove ODT, instead I reconfigured my DOS partitions so
that the first was smaller and I moved everything in the second
partition onto a new drive. I then installed OpenServer on the left
over space. From OpenServer's perspective, /etc/masterboot
is still valid. However, from ODT's perspective it is not.
If the hardware sees that you have a hard disk, but cannot find a
valid, active partition, you may see the message:
This message is the result of either a corrupt masterboot block, or
it is the result of a corrupt boot0.
If caused by a corrupt masterboot block, you can boot from the
boot/root floppy and recreate it as I described above. Also if the
driver parameters are wrong, you end up looking at some other part of
the disk from where you should and get this message. If you have a
SCSI hard disk, then it is unlikely that this message is the cause of
The first thing to check is the hard disk parameters (assuming you
don't have a SCSI hard disk). To see what the current parameters are,
which indicates the first drive on the first controller. You then
have a menu from which you select option 1 to display the current
disk parameters. If these parameters match your hard disk, then all
is well. Leave them alone. If they don't match you can modify them
with the correct values. How do you know what the correct values are?
These are one of those things that I told you to write down in your
You may find that the partition table was valid. If so, the problem
was more like the result of either a corrupt boot0
Unfortunately there is no magic command you can run to replace it
like the masterboot block. However, there are also copies of them on
the system, so you can use dd to
copy them onto the hard disk, like this:
of=/dev/hd0a bs=1k seek=1
If you think back to the section where we discussed the hard disk
layout, you'll remember that boot0 is a relatively small section on
the hard disk. You see that the size of /etc/hdboot0
is less than 1K. When we dd it
out to the hard disk, dd
only writes as many bytes are are in the file. Next, we have boot1.
This is a little large, but starts 1K from the beginning of the
partition. That's why we need to seek in 1 block. Here we set the
block size to 1K (bs=1k). Not that if you made a mistake a put in a
block size of 2, then dd would
seek in 2K before starting to right and you would end up overwritting
At this point, the system is trying to load and run the first real
If you are running ODT, then this is in the root directory of
your root filesystem and on OpenServer, this is in the root directory
of the /dev/boot
filesystem. For simplicity's sake, let's just call these both
the "boot" filesystem. Since the two operating systems use
the same files to boot and just their name is different, calling each
the boot filesystem makes life easier.
If boot1 runs into trouble loading /boot,
there are several things that could cause this. The easiest to
correct is if the division table has become corrupt. If you run
from your boot/root floppy set and see an empty table, like
table, you can simply input the values from your notebook.
(Are you beginning to understand why this notebook is so important?)
Unfortunately, a copy of the division table is not something that is
kept. If your division table is messed up and you didn't write down
the values, you're hosed.
One thing is very important when inputting the value into the
division table and that is don't create the filesystem. Only
put in the values for the starting and ending block. You don't even
have to name them. (Remember the name comes from the device node?) If
you create them, a new filesystem will be created. This means
that all the data will be lost.
Well, not entirely. You see all that is done is that the inode table
gets re-created. It's not like the disk is formatted. This is similar
to a quick format under DOS where the FAT is overwritten. The data is
still there, but there are no pointers to the data.
I accidentally did this on my own system with no backups. Since this
was the second hard disk and there was only one filesystem, things
were easier. The data was only text files and the disk was relatively
unfragmented, so I used a series of dd
commands to write
1Mb files onto my other partition. I could then look through these
files and decide if there was anything of value in them. It took
several hours, but I estimate I had a recovery rate of at least 95%.
Not bad, but I wouldn't recommend trying it yourself.
If /boot is not there or
otherwise can't be loaded you end up with a message similar to:
boot not found
1 boot failure: error loading hd(40)/boot
need /boot to boot off the
floppy, then this is a logical place to get it from. The first thing
you need to do is to boot from the boot/root floppies, of course.
Since you are actually copying files and not dd'ing
the contents onto the hard disk, you have to first mount the boot
filesystem. If the boot file system went down dirty, then the mount
will fail since you need to clean in first. More than likely
on OpenServer is clean since /dev/boot
is normally mounted read-only. Even if it was clean, there is
no harm in cleaning it again. Therefore, before I mount the boot
filesystem, I always run fsck.
If there were a lot of pipes open when the system went down, there
will be a lot of unreferenced files. Therefore, you might find
yourself pressing y to the
prompt to clear all these unreferenced, zero length files. Instead,
you could start fsck
with the -y option
to have it assume "yes" to each prompt. The question is:
"Do you feel lucky?"
Actually it isn't that bad. If the file contains data, it ends up in
/lost+found if not, it
gets trashed. Why do you want a lot of zero length files taking up
vital directory entries in lost+found?
My suggestion is you mount the boot filesystem like this if you are
/etc/mount /dev/root /mnt
and like this, if you are running OpenServer:
Now you can copy /boot
like any other program:
Now is a good time to check and see if there is a kernel on the boot
hard disk. It doesn't have to be the most recent one (probably called
unix), but anything in the root directory of the boot filesystem.
(e.g., unix.old, unix.orig,
unix.N1). If so, you can copy that to /mnt/unix.
Once you get past /boot
and can load your kernel, you are on the hard disk and have a
lot more options in terms of how to proceed.
If you discover that your root filesystem is too corrupt too mount
and fsck fails, you can
try to mount the filesystems as read-only and dump the filesystem to
tape. If this works, you, at least, have access to your files.
One thing I didn't mention, yet, is that we don't have to copy
many of these things from the floppy to the hard disk. He can take
advantage of some of the book magic I talked about in the section on
starting and stopping your system. Assume that just the masterboot
block is corrupt, so we can't boot from the hard disk. If we can get
to the Boot: prompt from
the floppy, then we can access the kernel and the root filesystem on
the hard disk. For example, if we had OpenServer, at the Boot:
prompt, we could type in:
In each case, it uses the hard disk (hd)
driver, first getting the hard disk off of minor 40, uses
minor 41 for the dump and swap devices, then uses minor 42 for the
root filesystem. Notice that this takes into account the different
filesystems that /unix
is on versus the root filesystem. For ODT the same command
would look like this:
If for some reason, there was no kernel in the boot filesystem, you
could change the location that Boot:
gets the kernel. Hopefully, the one on the floppy works, so
you could use that instead. Here the command might look like this for
This takes the kernel from the floppy device (fd)
with a minor number of 64, which is /dev/fd0.
If we wanted to extend this, we could take advantage of boot aliasing
and modify /etc/default/boot on
the floppy. We could create an alias that took the kernel and root
filesystem from the hard disk. It might look like this:
Therefore, when we boot off the boot floppy and get to the Boot:
prompt, we just type in hdunixroot
and it executes that bootstring.
If you have a system crash the "safest" thing is to
reinstall. However, "safe" doesn't always mean best. If the
crash was caused by a unique set of hardware and software
circumstances that won't occur for another six months, then
re-installing probably won't fix the problem. Even if you have the
CD-ROM distribution, re-installing means several hours of down time.
If you have the company president or a hoard of angry users breathing
down your neck, this is not a realistic option. You need to get the
computer up and running as soon as possible. You need to determine
the problem, find a solution and get everyone else back to work.
If you try to boot from the hard disk and get either of these
stage 1 boot failure not a
srmountfun- Error 6 mounting rootdev (1/40)
PANIC with srmountfun,
then the best thing is probably to restore from backups. These are
essentially saying that the system does not recognize the root
filesystem. If that's the case, you cannot continue. Out of the
dozens of crashes I've had to work users through, only once was this
corrected by going through all of the steps I described above.
If you have good, current, reliable backups of your programs and
data, then the most reliable method of crash recovery is to restore
from backup. If not, there are many professional data recovery
services, which stand a good to excellent chance of recovering your
data. They are relatively expensive and the turn-around time may be
a week or longer. If it is imperative that you recover your data,
this is your best chance.
A common problem is that the system appears to hang when it reaches
the prompt to press CTRL-D to
continue or enter root password for maintenance mode. Neither CTRL-D
nor the root password seem to work. Pressing ENTER
would normally just repeat that message. However, that too appears
not to work. This problems occurs often after the system crashes, as
well as when it is shutdown improperly,
The cause of this problem is a corrupt (not missing)
This controls I/O for the device /dev/syscon,
which is what is accepting input at this moment. To correct the
problem, you remove the file. When the system reboots, it sees the
file is missing and recreates it. Fortunately, you don't have to boot
from the floppy to remove it. The issue here is that the file is
corrupt and not behaving correctly. Instead of pressing the ENTER
key, you press CTRL-J.
Everything else should would correctly. So, if you wanted to go into
multi-user mode, pressing CTRL-D
then CTRL-J starts you on
your way to multi-user mode. If you enter the root password followed
by the CTRL-J, this will
bring you into maintenance mode.
Both the SCO OpenServer Handbook and ODT 3.0 System Administrator's
Guide contains steps for recovery from additional boot problems. They
also include many of the problems we discussed here, but I felt that
the importance of the issue as well as the frequency of these
problems warranted repeating the information.
Odds and Ends
Experience has taught me that sooner or later you will get to a
problem that you cannot solve. No matter how many changes you make to
the configuration files and no matter how many times you reboot, the
problem just won't go away. If the problem is the result of a bug,
then SCO Support will help you, even without a support contract.
There are often patches available that fix many known problems. These
are called Support Level Supplements, or SLSs. SLSs are available for
download via UUCP, ftp and SCO's Web Server.
In addition to SLSs, there are Enhanced Features Supplements, or
EFSs. These are not patches or bug fixes, but rather ehnancements to
the system. As a result, many are not available free of change.
Depending on where the EFS came from and what features it includes,
you may have to pay for more than just the media costs, but also
royalties or "development" costs.
SCO also provides free access to their Information Tools (IT) scripts
via the SCO Web Server. This is a set of thousands of articles
covering everything from a description virtual memory to details on
how to get multiple SCO operating systems on the same partition to
instruction on how to overcome bugs and other problems.
Available on CD is the SCO Support Services Library. This contains
the IT scripts as well as a character-based and X-based search and
viewing programs. Along with the IT scripts are many of the SLSs and
EFS. This is available through a yearly subscription from SCO.
Although the price might seem a bit much at first. It is well worth
the money considering the time saved by having the information and
supplements immediately available.
For more tips on getting help, I suggest you read the next chapter.
/bin/pstat - Report system information.
/bin/who - Lists who is on the system.
/bin/whodo - Determines what process each user is running.
/etc/badtrk - Checks for bad spots on your hard disk.
/etc/crash - Examine the running kernel.
/etc/custom - Displays information about install packages. (Also used
install and remove software.)
/etc/dfspace - Calculate available disk space on all mounted
(front end to df)
/etc/divvy - Create and administer divisions.
/etc/fdisk - Create and administer disk partitions.
/etc/fixperm - Correct and report on file permissions and ownership.
/etc/fixmog - Correct and report on file permissions and ownership.
/etc/fsck - Check and clean filesystems.
/etc/fuser - Indicates which users are using particular files and
/etc/hwconfig - Displays hardware configuration information.
/etc/ifconfig - Configure network interface parameters.
/etc/ps - Reports information on all processes.
/tcb/bin/authck - Examine and correct system security files.
/tcb/bin/integrity - Examine and correct system files against
/usr/adm/hwconfig - Hardware configuration log file.
/usr/adm/lastlog - Log file for each user's last login.
/usr/bin/cpio - Create archives of files.
/usr/bin/last - Indicate last logins of users and teletypes.
/usr/bin/llistat - Administer network interfaces.
/usr/bin/lpstat - Print information about status of print service.
/usr/bin/netstat - Administer network interfaces.
/usr/bin/sar - System activity report.
/usr/bin/swconfig - Report on software changes to the system.
/usr/bin/tar - Create archives of files.
/usr/bin/w - Reports who is on the system and what they are doing.
/usr/lib/acct/lastlogin - Keep record of date user last logged-in.
/usr/spool/lp/pstatus - Printer status information
See also the list of configuration files in chapter 16.
Table 0.1 Files
Used In Problem Solving
Next: Getting Help
Copyright 1996-1998 by James Mohr. All rights reserved. Used by permission of
Be sure to visit Jim's great Linux Tutorial web site at http://www.linux-tutorial.info/