Copyright 1996-1998 by James Mohr. All rights reserved. Used by permission of the author.
Be sure to visit Jim's great Linux Tutorial web site at http://www.linux-tutorial.info/
Using "Problem Solving" for this chapter was a very conscious decision. I intentionally avoided calling it "Troubleshooting" for several reasons. First, troubleshooting has always seemed to me to be the process by which we look for the causes of problems. Although that seems like a noble task, so often finding the cause of the problem doesn't necessarily mean finding the means to correct it or understanding the problem.
The next reason is that so often I find books where the troubleshooting section is just a list of problems and canned solutions. I find this comparable to the sign "In case of fire, break glass." When you break the glass an alarm goes off and the fire department comes and puts out your fire. What the real cause is may never be known to you.
The troubleshooting sections that I find most annoying list out 100 problems and 100 solutions, but I usually have problem 101. Often I can find something that is similar to my situation and with enough digging through the manuals and poking and prodding the system I eventually come up with the answer. Even if the answer is spelled out, it's usually a list of steps to follow to correct the problem. There are no details about what caused the problem in the first place or what the listed steps are actually doing.
In this chapter, I am not going to give you list of known problems and their solutions. The SCO documentation does that for you. I am not going to try to give you details of the system that you need to find the solution yourself. Hopefully, I did that in the first part of the book. What I am going to do here is to talk about the techniques and tricks that I've learned over the years to track down the cause of problems. Also, we'll talk about what you can do to find out where the answer is, if you don't have the answer yourself.
Problem solving starts before you have even installed your system. Since a detailed knowledge of your system is important to figuring out what's causing problems, you need to keep track of your system from the very beginning. One of the most effective problem solving tools costs about $2 and can be found in grocery stores, gas stations and office supply stores. Interestingly enough, I can't remember ever seeing it in a store that specializing in either computer hardware or software. What I am talking about is a notebook. Although a bound one will do the job, I find a loose leaf one more effective since you can add pages more easily as your system develops.
Included in the notebook is all the configuration information from your system, the make and model of all your hardware and every change that you make to your system. This is a running record of your system, so the information should include the date and time as well as the person making the entry. Every time you make a change, from adding new software to changing kernel parameters, should be recorded in your log.
Don't be terse with comments like, "Changed kernel parameter and relinked." This should be detailed like, "Changed DTHASHQS from 100 to 200. Relinked successful." Although it seems like busy work, I also believe things like adding users and making backups should be logged. If messages appear on your system, these too should be recorded with details of the circumstance. The installation guide contains an "installation checklist." I recommend completing this before you install and keep a copy of this in the log book.
Something else that's very important to include in the notebook is problems that you have encountered and what steps were necessary to correct that problem. One support engineer at SCO told me he calls this his "solutions notebook."
While you are assembling your system, write down everything you can about the hardware components. If you have access to the invoice, a copy of this can be useful for keeping track of the components. If you have any control over it, get your reseller to include details about the make a model of all the components. I have seen enough cases where the invoice or delivery slip contains generic terms like 486 CPU, cartridge tape drive and 500MB hard disk. Often this doesn't even tell you if the hard disk is SCSI, IDE or what.
Next, write down all the settings of all the cards and other hardware in your machine. The jumpers or switches on hardware are almost universally labeled. This may be something as simple as J3, but as detailed as IRQ. SCO installs at the defaults on a wide range of cards and generally they are few conflicts unless you have multiple cards of the same type. However, the world is not perfect and you may have a combination of hardware that neither I nor SCO has ever seen. Therefore, knowing what all the settings are can become an important issue.
One suggestion is to write this information on gummed labled or cards that you can attach to the machine. This way you have the information right in fron of you every time you are working on the machine.
Many companies have a "fax back" service, where you can call a number and have them fax you documentation to their products. For most hardware, this is rarely more than a page or two. However for something like the settings on a hard disk, this is enough. This has a couple of benefits. First, you have the phone number for the manufacturer of each of your hardware components. The time to go hunting for it is not when your system has crashed. Next, you have (fairly) complete documentation for your hardware. Lastly, by collecting the information on your hardware you know what you have. I can't count the number of times I have talked with customers who don't even know what kind of hard disk, let alone the settings.
Another great place to get technical information is the World Wide Web. I recently bought a SCSI hard disk that did not have any documentation. A couple of years ago that might have bothered me. However, when I got home I quickly connected to the Web site of the driver manufacturer and got the full drive specs as well as a diagram of where the jumpers are. If you are not sure of the company name, take a guess like I did. I tried www.conner.com and it worked the first time.
When it gets time to install the operating system, the first step is read the release notes and installation guide. I am not suggestion reading them cover to cover, but look through the table of contents completely to ensure there is no mention of potential conflicts with your host adapter or the particular way your video card needs to be configured. The extra hour you spend doing that will save you several hours later, when you can't figure out why your system doesn't reboot when you finish the install. Oh, did I mention that you should read the release notes and installation guide? You should. They're very important.
As you are actually doing the installation, the process of documenting your system continues. Depending on what type of installation you choose, you may or may not have the opportunity to see many of the programs in action. If you choose an automatic installation, then many of the programs are run without your interaction, so you never have a chance to see and therefore document the information.
The information you need to document is the same kinds of things we talked about in the section on finding out how your system was configured. This includes the hard disk geometry (dkinit), partitions (fdisk), divisions and filesystems (divvy), the hardware settings (hwconfig) and the kernel parameters (mtune and stune). The output to all of these commands can be sent to a file, which can be printed out and stuck in the notebook.
I don't know how many times I have said it and how many articles it has appeared in (both mine and from others), some people just don't want to listen. They often treat their computer system like a new toy at Christmas. They first want to get everything installed that is visible to the outside world such as terminals and printers. In this age of net-in-a-box, often that extends to getting their system on the Internet as soon as possible.
Although being able to download the synopsis of the next Deep Space Nine episode is an honorable goal for some, Chief O'Brien is not going to come to your rescue when your system crashes. (I think even he would have trouble with the antiquated computer systems of today)
Once you have finished installing the operating system, the very first device you need to get installed and configured correctly is your tape drive. If you don't have a tape drive, buy one! Stop reading right now and go out and buy one. It has been estimated that a down computer system costs a company, on the average, $5000.00 an hour. You can certainly convince your boss that a tape drive costing one-tenth as much is a good investment.
One of the first crash calls I got while I was in SCO Support was from the system administrator at a major airline. After about 20 minutes, it became clear that the situation was hopeless. I had discussed the issue with one of the more senior engineers who determined that the best course of action was to reinstall the OS and restore the data from backups.
"What backups?", I can still remember their system administrator say. "There are no backups."
"Why not?" I asked.
"We don't have a tape drive."
"My boss said it was too expensive."
At that point the only solution was data recovery service.
"You don't understand," he said, "there is over $1,000,000 worth of flight information on that machine."
"Not any more."
What is that lost data worth to you? Even before I started writing this book, I bought a tape drive for my home machine. For me it's not really a question of data, but rather time. I don't have that much data on my system. Most of it can fit on a half-dozen floppies. This includes all the configuration files that I have changed since my system was installed. However, if my system was to crash, the time I save restoring everything from tape as compared to reinstalling from floppies, is worth the money I spent.
The first thing to do once the tape drive is installed is to test it. The fact that it appears at boot says nothing about its functionality. It has happened enough that it appears to work fine, all the commands behave correctly and it even looks as if it is writing to the tape. However, it is not until the system goes down and the data is needed that you realize you cannot read the tape.
I suggest first trying the tape drive out by backing up a small sub-directory such as /etc/default. There are enough files to give the tape drive a quick workout, but you don't have to wait for hours for it to finish. Once you have verified that the basic utilities work (like tar or cpio), then try backing up the entire system. If you don't have some third party backup software, I recommended that you use cpio. Although tar can backup up most of your system, it cannot backup device nodes. Don't use something like dump, as this simply makes an image of your filesystem and getting back individual files is next to impossible.
I personally use a "super-tar" product on my system. These are from third party vendors that not only provide an a very usable interface, they often do bit level verification and provide complex inclusion and exclusion mechanisms. The discussion of which one of these super-tar products to use is almost religious. Like religion, it's a matter of personal preference. I use Cactus' Lone-Tar because I have a good relationship with the company president, Jeff Hyman, who pops up regularly in the SCO Forum on CompuServe. Lone-Tar makes backups easy to make and easy to restore. A working demo of Lone-Tar (as well as other super-tar products) is available from the SCO libraries on CompuServe.
After you are sure that the tape drive works correctly, then you should create a boot/root floppy. A boot/root floppy is a pair of floppies that you use to boot your system. The first floppy contains the necessary files to boot and root floppy contains the root filesystem.
Creating a boot/root floppy can be done by either using mkdev fd or sysadmsh on ODT or the Floppy Filesystem Manager on OpenServer. Here you want to create both a boot and a root floppy. The primary reason for making this just after you install the tape drive and before anything else is simply a matter of space. If you have installed too much, the kernel will not be able to fit on the floppy. Once you've make the boot/root floppy set, test it. Also test the tape drive from the floppy. Although not that common, I have seen cases where the major and minor number on the floppy for the tape drive was incorrect.
Now that you are sure that your tape drive and your boot/root floppy set work, you can begin installing the rest of your software and hardware. My preference is to completely install the rest of the software first. This includes other SCO products. There is less to go wrong with the software (at least little that keeps the system from booting) and you can, therefore, install several products in succession. When installing hardware, you should install and test each component before you go on to the next one.
I think it is a good idea to make a copy of your link kit before you make any changes to your hardware configuration. That way you can quickly restore the entire directory and don't have to worry about restoring from tape. There is a good example of how to copy an entire directory tree on the cpio(C) man-page. Use that example to copy the entire /etc/conf sub-directory. I suggest using a name that is clearer than /etc/conf.bak. Six months after you create it, you have no idea how old it is or whether the contents are still valid. If you name it something like /etc/conf.06AUG95, then it is obvious when it was created.
Now, make the changes and test the new kernel. After you are sure that the new kernel works correctly, then make a new copy of the link kit and make more changes. Although this is a slow process, it does limit the potential for problems, plus if you do run into problems, you can easily back out of it by restoring the backup of the link kit.
As you are making the changes, remember to record all the hardware and software settings for anything you install. Although you can quickly restore the previous copy of the link kit if something goes wrong, writing down the changes can be helpful if you need to call to tech support.
Once the system is configured the way you want, make a backup of the entire, installed system on a different tape then just the base operating system. I like to have the base operating system on a separate tape in case I want to do some major revisions to my software and hardware configuration. That way, if something major goes wrong, I don't have to pull out pieces, hoping that I didn't forget something. I have a known starting point that I can build from.
At this point you should come up with a backup schedule. The System Administrator's Guide provides you with some guidelines on this. Keep in mind that these are "guidelines." The information provided there is not a list of unchangeable rules. You should backup as often as necessary. If you can only afford to lose one day's worth of work, then backing up every night is fine. Some people backup once during lunch and once at the end of the day. More often than twice a day may be too great a load on the system. If you feel like you have to do it more often, you might want to consider disk mirroring or some other level of RAID. See the section on filesystems for a discussion of the SCO implementation.
The type of backup you do is dependent on several factors. If it takes ten tapes to do a backup, then doing a full backup of the system (that is, backing up everything) every night is difficult to swallow (You might consider getting a larger tape drive). In a case where a full backup every night is not possible. There are two alternatives.
First, there are incremental backups. These start with a master, which is a backup of the entire system. Then the next backup only records the things that have changed since the last incremental. This can be expanded to several levels. Each level backups everything that has changed since the last backup of that or the next lower level. For example.
For example, level 2 backups up everything since the last level 1 or the last level 0 (whichever is more recent). You might do a level 0 once an month (which is a full backup of everything), then a level 1 every Wednesday and Friday and level 2 every other day of the week. Therefore, on Monday, the level 2 will backup everything that has changed since the level 1 on Friday. The level 2 on Tuesday will backup everything since the level 2 on Monday. Then on Wednesday, the level 1 backups up everything since the level 1 on the previous Friday.
At the end of the month, you do a level 0 which backsup everything. Let's assume this is on a Tuesday. This would normally be a level 2. The level 1 on Wednesday, backup up everything since the level 0 (the day before) and not since the Level 1 on the previous Friday.
A somewhat simpler scheme uses differential backups. Here, there is also a master. However, subsequent backups will record everything that has changed (is different) from the master. If you do a master once a week and differentials once a day, then something that gets changed on the day after the master is recorded on every subsequent backup.
A modified version of the differntial backup does a complete, level 0 backup on Friday. Then on eack of the other days, a level 1 is done. Therefore, the backup Monday-Thursday will backup everything since the day before. This is easier to maintain, but you may have to go through 5 tapes.
The third type is the most simple, this is where you do a master backup every day and forget about increments and differences. This is the method I prefer since you save time when you have to restore your system. With either of the other methods, you will probably need to go through at least two tapes to recover your data, unless the crash occurs on the day after the last master. If you do a full backup every night, then there is only one backup to load. If the backup fits on a single tape (or at most 2), then I highly recommend doing a full backup every night. Remember that the key issue is getting your people back to work as soon as possible. The average $5000 per hour you stand to loose is much more than the cost of a large (8Gb) tape drive.
This brings up another issue and that is rotating tapes. If you are making either incremental or differential backups, then you must have multiple tapes. It is illogical to make a master, the make an incremental on the same tape. There is no way to get the information from the master.
If you make a master backup on the same tape very night, you can run into serious problems, as well. What if the system crashes in the middle of the backup and trashes the tape? Your system is gone and so is the data. Also if you discover after a couple of days that the information in a particular file is garbage and the master is only one day old, then it is worthless for getting the data back. Therefore, if you do full backups every night, use at least five tapes, one for each day of the week. (If you run seven days a week, then seven tapes is a good idea)
Although most people get this far in thinking about tapes, many forget about the physical safety of the tapes. If your computer room catches fire and the tapes melt, then the most efficient backup scheme is worthless. Some companies have fireproof safes that they keep the tapes in. In smaller operations, the system administrator can bring the tape home from the night before. This is normally only effective when you do masters every night. If you have a lot of tapes, you might consider companies that provide off-site storage faciliites.
Have you ever tried to do something and it didn't behave the way you expected it to? You read the manual and typed in the example character for character only to find it doesn't work right. Your first assumption is that the manual is wrong, but rather than calling in a bug to SCO Support, you try the command on another machine and to your amazement, it behaves exactly as you expect. The only logical reason is that your machine has gone insane.
Well, at least that's the attitude I have had on numerous occassions. Although this personification of the system helps relieve stress sometimes, it does little to get to the root of the problem. If you want, you could check every single file on your system (or at least the ones related to your problem) and ensure that permissions are correct, the size is right and if all the support files are there. Although this works in many cases, often what programs and files are involved are not easy to figure out.
Fortunately help is on the way. SCO provides several useful tools to not only check the sanity of your system, but to return it to normal. The first set of tools we already talked about. These are the monitoring tools such as ps, crash, sar¸ and vmstat.Although these programs cannot correct your problems, they can indicate where problems lie.
If the problem is the result of a corrupt file (either the contents are corrupt or the permissions are wrong), the system monitoring tools cannot help much. However, there are several tools that specifically address different aspects of your system.
For starters, let's take the issue of incorrect permissions. Under both ODT and OpenServer, there are two options: fixperm and fixmog. The advantage that fixperm has is that, it can not only tell you about permissions problems, but it can also you tell you the install status of the different packages as well as create missing device nodes and directories. On the other hand, fixmog is fast since it is designed to fix security related problems. Therefore, discrepancies in files such as vi are ignored.
Fixperm uses what are referred to as permissions lists. These are represented by the files in /etc/perms. In general, each file represents a single product. If you have TCP/IP, for example, the files contained in the run-time TCP/IP product would be found in /etc/perms/tcprt. If you had the development system for TCP/IP, this would be represented by the file /etc/perms/tcpds. The exception is the operating system itself.
As I mentioned in an early chapter, each product can be broken down into packages. These packages appear in different sections of the permissions lists. Each file within that package is shown on a separate line, like this:
RTS f644 root/sys 1 ./etc/gettydefs B02
The fields are: package (here RTS for run-time system), type of file and mode (here f for regular file, and permission 644), user/group (root/sys), links (1), path (./etc/perms), volume (B02). Note that the volume was originally designed to tell you what floppy disks this file was one. However, if you have a tape or CD-ROM installation, then everything is on one volume.
If you are a new administrator, you may not know what kind of media you installed on. To find this out, look in /etc/perms/bundle/odtps and find the "mediatype" entry.
If we have a file with more than one link (for example /usr/bin/mail), rather than individual entries for each link, there is a single entry like the one we have and then the name of each additional link is listed. So, the entry for /usr/bin/mail would look like this:
As it's making its checks, fixperm sees that /usr/bin/mail has two more links and can immediately check them. Since they are links, they have are the same file and have the same permissions, owner, group, etc., fixperm only needs to ensure that they all exist and check the permissions on one of them. If it corrects the permissions on one of them, it fixes them for all. For more details on the different perms files, options, etc., see the fixperm(ADM), custom(ADM) and perms(F) man-pages.
Like fixperm, fixmog has it's own database of information. This is the file /etc/auth/system/files, which represents the File Control Database. (See the section on security for more information) Here we have the same basic information as in the permissions list. However we are not concern with packages, volumes or even links. All we are concerned with is access permissions, owner, group and what type of file it is. In addition to fixmog, you can also use cps. This works on a single file and not on the entire File Control Database as fixmog does.
One major problem that both fixperm and fixmog have is that they only check for the existence of the file, as well a file attributes such as permissions and owner, but neither the size or checksum of the file. This is a major issue when files become corrupt. The permissions may be correct and even the size might match, however if the file is corrupt, then the checksum is most likely going to be wrong.
SCO provides a utility to compute a checksum on a file, called sum. It provides three ways of determining the sum. The first is with no options at all, which reports a 16-bit sum The next way uses the -r option, which again provides a 16-bit checksum, but uses an older method to compute the sum. In my opinion, this method is more reliable since the byte order is important as well. Without the -r, a file containing the word "housecat" would have the same checksum if you changed that single word to "cathouse." Although both words have the exact same bytes, they are in a different order and give a (slightly) different meaning.
Because of the importance of the file's checksum, I created a shell script while I was in SCO support that was run on a freshly installed system. As it ran, it would store in a database all the information provided in the permissions lists, plus the size of the file (from an ls -l listing), the type of file (using the file command) and the checksum (using sum -r). If I was on the phone with a customer and things didn't appear right, I could do a quick grep of that file name and get the necessary information. If they didn't match, then I knew something was out of whack.
Unfortunately for the customer, much of the information that my script and database provided was something that they didn't have access to. Now, each system administrator could write a similar script and call up that information. However, most administrators do not consider this issue until it's too late. OpenServer corrected much of that problem with the introduction of the Software Manager.
Not only can the Software Manager check for the existence of files, but can also verify the checksum of the files. Many things can be corrected automatically, but some require that you explicitely request the Software Manger to make the corrections. For example, let's assume that a fat fingered system administrator removed /usr/bin/mail. The software manager would tell you that the file was missing, but would not automatically restore the link until you told it to fix the discrepancies.
If the file that we, as users, access is missing or corrupt (such as /usr/bin/mail), then this method works fine. However, if the file contained within the SSO is missing or corrupt, then the situation is more serious. This is the same situation you would have in ODT without the Software Manager. You need the installation media. Although it's nice to have the Software Manager tell you can't fix it and the user must do it by hand, as of this writing there is nothing in the help files directly accessed from the verify portion of the Software Manager that tells you what to do when something is missing.
ODT was nice in that the same program that told you what file (or files) was missing also allowed you to reinstall that missing file. Some might be saying that the Software Manager tells you what's missing and also lets you reinstall, so it does the same thing, right? Unfortunately not. Although substantially more powerful in many regards than custom and fixperm in ODT, OpenServer's Software Manager has broken some of the basic functionality. The primary example is reinstallation of a single file. This also extends to groups of files, even those that are completely unrelated, which were both possible in ODT.
A solution would be to be able to reinstall the public portions of a particular SSO/product. That way you replace the binaries. without touching the data files. Unfortunately, that avenue has been blocked. You can no longer reinstall a package, like you could in ODT. Further, you cannot select from the list of packages on the system (i.e., within an SSO) to say you want to install that package new (after you have removed it). Instead you must first read the installation media to get a list of software to install. If you have another machine in the network with OpenServer, you can do a network install of that package, which is a fair bit faster.
However, this doesn't really mean that the only way to get back single files is reinstalling the package. Fortuantely, the SCO engineers are not that viscious. They provided for us the customextract utility, whose purpose it is is to extact files from an "SSO distribution source." What this means is that it can pull of individuals files from tape, CD or whatever you used to install. The basic, and most commonly used, sytnax is:
customextracty -m<device> <SSO_path_name>
Where <device> is the device where the media resides and <SSO_path_name> is the path name within the SSO, not the path we are used to seeing. For example, to restore /usr/bin/vi from a CD, the command might look like this:
customextract -m /dev/rcdt0 /opt/K/SCO/Unix/5.0.0a/usr/bin/vi
Note that the files are restored based on your current working directory. Therefore, you might want to consider first changing directories to / . If you want to extract the files first and copy them into their proper location by hand, you can change directories into /tmp. In addition, you can specify a list of files to extract using the -f option followed by the name of the file containing the list.
You can also use custom in OpenServer to verify your software as well as correct certain problems. These are the same kinds of problems that can be corrected using the Software Manager. To be able to use this functionality, you have to be familiar with the way SSOs are put together. If you are still having trouble, we talked about SSOs in the section on SCO basics.
An example of using custom os OpenServer to verify and correct the Run-Time System (RTS) might look like this:
custom -v SCO:Unix:RTS -x
This says I want to verify (-v) the RTS package of the UNIX product from the manufacturer SCO. The package name can be found in the SSO in the directory /var/opt/K/<vendor>/<product>/<release>/.softmgt in the files with the .fl ending. For example, the file that the above command is reading is /var/opt/K/SCO/Unix/5.0.0Cd/.softmgmt/RTS.fl. These have a slightly different format than the old perms list, but are fairly straight forward to read. See the custom(ADM) man-page for more details.
Also new to OpenServer is the customquery command. This is a very useful tool for not only finding out what is installed, but also what versions. The basic syntax is:
customquery function options
where the functions include ListComponents, ListPackages, ListFiles and ListDescriptions. See the customquery(ADM) man-page for more details.
As we have just seen, there is a way to correct problems when commands, utilities, and other applications are corrupt. If a data file is corrupt, that's a different story entirely. It most cases, it is impossible for the operating system to know what is valid data and what is not. Therefore, it cannot be expected to be able to correct such data corruption.
If you do have corruption in an applications data file, you need to turn to the application vendor for possible means of correcting the problem. Well, what if that vendor is SCO? If there was some corruption in a data file used by an SCO program, they would be the ones who would know how to correct it. In many cases, this is impossible, There is no way the system could correct problems, let's say, in the nameserver data files. There is a certain format the files need to follow, but the name server relies on human intervention to ensure that these files are created correctly. Since there is nothing to compare these files to, (no reality check) there is little that can be done to correct the problem.
Fortunately, in the case of the TCB, there is such a reality check. This is the authck utility. Not only does it understand the formats of the different files and can identify problems, it can also correct many of them. So important is the consistency of the TCB, that the system runs authck every time you boot. This is done by the shell script /etc/authckrc, which is started by init from /etc/inittab.
The authck utility needs to be run either by root, or some other user with the auth subsystem privilege. Note also that the chown kernel privilege is also necessary if you want to make the changes that authck discovers. There are options to check each of the primary TCB databases. The -p options checks the Protected Password Database, the -s checks the Protected Subsystem Database and the -t checks the Terminal Control Database.
If you plan on checking more that one database at a time, I recommend either using the applicable option together (i.e., -ps ) or using the -a option to check all databases. Since this is a lot quicker than checking each database individually as the databases do not need to be reloaded every time. If you want you can have authck automatically correct problem it finds by using the -y option, or by using the -n option you can tell it to correct nothing. Also useful is the -v option. This verbosely outputs all the problems that authck finds, whether you tell it to correct them or not.
We now get to the "sanity checker" that perhaps most people are familiar with: fsck, the filesystem checker. Anyone who has lived through a system crash or had the system get shutdown improperly has seen fsck. One unfamiliar aspect of fsck is the fact that it is actually several programs, one for each of the different filesystems. This is done because of the complexities of analyzing and correcting problems on each filesystem. As a result of these complexities, very little of the code can be shared. What can be shared is found within the /etc/fsck program.
When it runs, fsck determines what type of filesystem you want to check, and runs the appropriate command. The "real" fsck command as well as many other commands is found in /etc/fscmd.d, where each of the sub-directories is named for the filesystem type that the commands are used on. Here you find a whole set of commands that are used to access and manipulate a filesystem.
If you look, you will see that there are sub-directories for, among others, ISO-9660 and ROCKRIDGE filesystems. Some of you may know that these are filesystems found on CD-ROMs. CD-ROMs are read-only filesystems. What's the point of running fsck on a filesystem that is read-only? Even if it was corrupt, there is no way to correct things. What purpose does it server to have fsck even look at these filesystem?
Well, there is no point. That is the point. So much so that there is no fsck program for these filesystems. If you looked in /etc/fscmd.d/ISO9660 you would not see an fsck program. In fact, here are only two programs there: fstyp and mount. These are actually links to the same files for the other CD-ROM filesystem: RCKRDG (Rockridge).
Regardless of what kind of filesystem you are checking, fsck is a very complex program because the structure it is cleaning and trying to correct is very complex. Depending on the filesystem type you are checking, fsck goes through up to eight different phases. For HTFS, EAFS, AFS, and S51K filesystems the phases are the same, but for DTFS filesystems, there are different. Therefore, we are going to first talk about the phases for HTFS, EAFS, AFS, and S51K.
Phase 0 is new to OpenServer. This is because is during this phase that the intent log is replayed if intent logging has been enabled. Unless you specify a full check, outstanding transactions are completed and the filesystem is marked as clean. Since the filesystem is clean at this point, there is no need to check further and fsck exits.
In phase 1 of a full check, fsck checks the inode table. Part of what is done is comparing the size of the file to the number of blocks allocated on the hard disk. In addition, fsck checks to ensure that the number of links to this file is not zero. This could possibly mean that someone had removed the file and the system did not have time to update the inode table before the system went down. Here, too, fsck checks the sanity of the disk blocks. If one of the data blocks pointed to in the inode is outside of the boundaries of the filesystem, then fsck knows that something is wrong. The result is that fsck removes the incorrect information, which means that the file is removed, as well.
Another key aspect of phase 1 is searching for duplicate blocks. This is where more than one inode point to the same data block on the hard disk. Don't think that this is what a link is. A link is a file name that points to the same inode. Here we have multiple inodes that point to the same data. This is not supposed to happen. Once the situation is corrected, fsck checks again to see if there are more duplicate blocks. This is phase 1b.
Phase 2 is used to clean-up what was found in phase 1. For example, if we find there are files with duplicate blocks, fsck has no choice but to remove both files. This is accomplished in phase 2. Here, too, fsck cleans up directory entries that either point to inodes that are empty or ones that are non-existent. Non-existent inodes occur when the inode is larger than the maximum possible on the filesystem or is negative. You have only as many inodes as were allocated when the filesystem was created.
In phase 3, file connectivity is checked. For example, let's say an inode exists and points to valid data but there is no directory entry on the system for it. This is referred to as an unreferenced file. Something needs to get done with it, so it is placed in the /lost+found directory. This is normally in the root directory of the filesystem. It is then given a new name, which is simply its inode number.
This phase brings up a few interesting issues. First, if it is a directory that has no entry in some other (parent) directory, it too will be placed in lost+found. Now there is a sub-directory of /lost+found with its name being the inode of that sub-directory. However, the contents of the directory "file" are intact. Therefore, the file name-inode pairs in this directory are intact. You can cd into that directory and see all the file names as if nothing had every happened. I have seen it where such a directory was /bin (or was at it /usr/bin?). All that had to be done was to rename it /bin (or /usr/bin) and things were back to normal.
Next, there can only be a maximum number of unreferenced files found on your system. Remember that each directory is simply a file in a specific format. All that is done during this phase of fsck is that the directory file is being filled with the names of the "lost" files. As I mentioned in the section on filesystem, the size of the lost+found directory does not change during fsck. Therefore, you run into problems if a lot of unreferenced files are encountered. See the section on filesystem for more details.
If the unreferenced file is empty, there is no logic in placing it in lost+found. Therefore, when fsck encounters an unreferenced, zero-length file it will prompt you to clear the inode. (Unless you used the -y option, in which case it is cleared automatically) This can be very upsetting to some administrators because often this phase reports a large number of unreferenced, zero-length files. No worries. These are usually only unnamed pipes that have been created on the filesystem. (The HPPS does this all in memory and avoids this issue.)
In Phase 4, fsck checks the link count of files. For example, if there are only two files that reference a particular inode, but the link count in that inode is 3, then fsck needs to correct it.
During Phase 5, fsck examines the free-block list to resolve any missing or unallocated blocks. Once all inconstancies have been corrected, fsck rebuilds the free list, in phase 6.
As I mentioned before, fsck checks and repairs a DTFS on it's own. Because of the nature of the filesystem, there are only four phases. In phase 1, fsck reads the inode bitmap and initializes the block bitmap. In phase 2, the inodes are "validated." Part of this process is to ensure that the B-tree structure of each file is maintained, so that the tree remains balanced.
In phase 3, fsck rebuilds the directory structure. Remember from our discussion on filesystem I mentioned that the inode of each file contains a pointer to the parent directory. Using this, fsck can easily rebuild the directory structure and there is no need for a lost+found directory since files are not lost. If the inode is trashed, then the disk blocks exist without an inode. In that case, there is no way to rebuild them into the original file and the blocks can simply be returned to the free list and the block bitmap is updated. In phase 4, the superblock is updated.
You may find yourself cleaning a large filesystem, where all the necessary table cannot fit into memory. As a result, fsck requires a scratch file to store the tables. This can be a real file on some other filesystem or you define a separate filesystem just for scratch. If you know that the filesystem is too large, then you can specify the scratch file on the command line with the -t option. If you don't, fsck will recognize the need for a scratch file and prompt you. If you don't have a special scratch device you can simply used /dev/swap.
System problems fall into several categories. The first is difficult to describe and even more difficult to track down. For lack of a better word, I am going to use the word "glitch." These are problems that occur infrequently and in circumstances that are not easily repeated. These can be caused by anything from users with fat fingers to power fluctuations that change the contents of memory.
Next are special circumstance in software that are detected by the CPU while in the process of executing a command. We discussed these briefly in the section on kernel internals. These are traps, faults, and exceptions. Many of these events are normal parts of system operation and are, therefore expected. This includes such things as page faults. Other events like following an invalid pointer are unexpected and will usually cause the process to terminate.
What if it is the kernel that causes either a trap, fault, or exception? As I mentioned in the section on kernel internals, there are only a few cases when the kernel is allowed to do this. If this is not one of those cases, the situation is deemed so serious that the kernel must stop the system immediately to prevent any further damage. This is a panic.
When the system panics, it uses it last dying breath the run a special routine that prints the contents of the internal registers onto the console and dumps the contents of RAM onto the swap device. At the end, it will call the kernel function haltsys( ) which stops the system.
Despite the way it sounds, if your system is going to go down, this is the best way to do it. The rationale behind that statement is that when the system panics in this manner, there is a record of what happened. First, there is the dump image on the swap device. Second, there is the register dump on the console screen. Both of these are essential pieces of information when your system goes down.
If the power goes out on the system, then it is not really a system problem in the sense that it was cause by an outside influence. Similar to someone pulling the plug or flipping the circuit breaker (which my father-in-law did to me once). Although this kind of problem can be remedied with a UPS, the first time the system goes down before the UPS is installed can make you question the stability of your system. There is no record of what happened and unless you know the cause was a power outage, it could have been anything.
Another annoying situation is when the system just "hangs." That is, it stops completely and does not react to any input. This could be the result of a bad hard disk controller, bad RAM, an improperly written or corrupt device driver. Since there is no record of what was happening, try to figure out what went wrong is extremely difficult. Especially if it is very sporadic.
Since a system panic is really the only time we can easily track down the problem, I am going to start there. First thing to think about is the fact that as the system is going down it does two things: writes the registers to the console screen and write a memory image the dump device. The fact that it does so as it's dying make me think that this is something important. Which it is.
The first thing to look at is the instruction pointer. This is actually composed of two registers: the CS (code segment) and EIP (instruction pointer) registers. This is the instruction that the kernel was executing at the time of the panic. By comparing the EIP of several different panics, you can make some assumptions about the problem. For example, if the EIP is consistent across several different panics, this indicates that there is a software problem. The assumption is made because the system was executing the same piece of code every time it panics. This usually indicates a software problem.
On the other hand, if the EIP consistently changes, then this indicates that probably no one piece of code is the problem and it is therefore a hardware problem. This could be bad RAM or something else. Keep in mind, however, that a hardware problem could cause repeated EIP values, so this is not a hard coded rule.
If your system was able to successfully write the dump image to the swap device, then things may be a little simpler. If so, you can examine the dump image and find out the name of the function the system was in when it panicked. This is a lot more useful than just an EIP, because the function name can point to something more specific. Say for example the system panicked inside of the Sdskintr() routine, I know that this has to do with the Sdsk (SCSI hard disk) driver. Therefore, I might consider a hardware problem with the SCSI hard disk.
When the system boots after a panic, it recognizes that there is a dump image on the dump device, provided the dump device is /dev/swap. At that time you have the option of saving the image, removing it or simply leaving it alone. If you leave it alone, it remains on the dump device until you decide to remove it. Note that the first time you swap, the image gets trashed.
When you do boot, you need to go into single user mode (to ensure you don't start swapping). Once there, run crash -d /dev/swap (assuming /dev/swap is the dump device). When you reach the prompt (>), simply type in panic. This will give you a stack trace of the last few system calls executed before the crash. The top one will be the one the system was running when it crashed. If it is not obvious from the name of the function, you can look through the Driver.o files in /etc/conf/cf.d using either nm or strings to find out which driver the function is in.
Why do I specifically tell you to look in the device driver files? Well, since 99% if the time the cause of the panic is a corrupt or poorly written device driver, this is a good place to start.
One system panic does not necessarly tell you what the problem was. If you run crash and the panic command says that there is something wrong with the Srom driver, you cannot assume there wrong with your CD-ROM drive. (Assuming you knew what the Srom driver was.) It could just as easily be a bad sport of RAM that was not detected as having a parity error. If you have multiple panics that seem to point to the same thing, but cannot figure out exaclty what the cause is. You should run crash -d <dumpdevice> -o <file> to save the output and run the follow functions inside of crash:
panic - prints the panic information
trace - prints a kernel stack track
user - prints uarea of the active process when the system panicked
proc - prints the process table when the system panicked.
Whoever is providing you with your support, should be able to make some sense of it.
The problem with this approach is that the kernel is generally loaded in the same way all the time. That it, unless you change something, it will occupy the same area of memory. Therefore, it's possible that bad RAM makes it look like there is a bad driver. The way to verify this is to change where the kernel is loaded in physical. You can do this by either re-arraigning the order of your memory chips or using the mem= option to boot to limit what memory is accessed.
Keep in mind that this technique probably may not tell you what SIMM is bad, only indicate that you may have a bad one. The only sure fire test is to swap out the memory. If the problem goes away with new RAM and returns with the old RAM, you have a bad SIMM.
If you are unfortunate to have the system hang or even reboot itself, then there is no dump image to look at and no EIPs to compare. The first place to look is the system log file, /usr/adm/messages. Even if the system did panic, there may be some information there to indicate what went wrong. This is often in the form of a kernel message.
Kernel messages fall into five categories and usually have the format:
category: name: routine message
The category can be one of the following, in increasing order of severity: CONFIG, NOTICE, WARNING, FATAL, and PANIC.
Although not always present, the name represents the device driver or sub-system name having problems. If it is a device driver, you will probably see the major and minor numbers of the offending devices. This makes tracking down the problem a lot easier. The routine portion is also not always present and usually is not as obvious as the major and minor number. However, you can still attempt to track down the device by looking through the Driver.o files.
A CONFIG message normally indicates that the value of one of the kernel tunable parameters has been exceeded. This will be followed by the kernel parameter in question and the value that was exceeded. The remedy is to either increase the parameter or limit access to that resource. For example, if you are running ODT or OpenServer with a maximum value for the size of the process table, you might get a message that says you have exceeded that limit. To correct this, either increase the size of the table (NPROC on ODT or MAX_PROC on OpenServer) or limit the number user so you don't run out of processes. On the other hand, if the problem is caused by some run-away process, rebooting the system might correct the problem. See the section on monitoring system activity for more details.
A NOTICE is somewhat more urgent that a CONFIG message. This indicates that a situation has occurred that should be monitored. For example, running out of space or inodes on a filesystem would generate a NOTICE. Normally, this is not associated with any kernel parameter so a relink and reboot is not necessary. An example of this would be:
NOTICE: Srom: Not ready on SCSI CD-ROM 0 dev 51/0 (ha=0 id=5 lun=0)
For some reason, my CD-ROM is not ready. This might indicate a hardware problem. However, in my case this came from the fact that I automatically mount a CD-ROM during boot up and this time there was no CD in the drive. But, these can also indicate a more serious problem. For example,
NOTICE: Sdsk: Unrecoverable error reading SCSI disk 0 dev 1/0 (ha=0 id=0 lun=0)
When I ran scsibadblk (used to check bad blocks on a SCSI device) this block came up as bad. I was personally upset at treating a bad block as something comparable to not having a CD in the drive. In my opinion this error requires a higher level message, like a warning.
WARNINGs may require immediate attention. The keyword is "may." It's possible that the warning could be something harmless like this:
WARNING: floppy: Disk is write protected in fd0 dev 2/60
On the other hand it could be something dramatic as:
WARNING: floppy: Read error on dev 2/60, block=20 cmd=0x03 status=0x01
Which indicates a problem that might make me lose data. To me a write protected floppy is as the same level as not having a CD in the drive. Being able to determine the severity of the problem from these messages is not always simple.
FATAL errors are not happy things. These can be the result of hardware problems such as:
FATAL: Parity error in the motherboard memory
This means to need to replace some of your RAM. However, this kind of message can be the result on fat fingers. For example:
FATAL: Bad bootstring syntax - kernel.auito
Look carefully. There is an extra 'i' in "autio." This was the result of me wanting to go right into multi-user mode by typing unix.auto at the Boot: prompt. Because /boot didn't know what to do with my typo, I got this message. Normally, FATAL messages usually appear before the system panics. Therefore it is a good idea to keep track of these messages. This is also a case where the FATAL message is immediately followed by:
PANIC: Illegal bootstring, cannot continue
Some things are just so picky.
Lastly is our old friend PANIC. When it gets to this stage, things are too severe to continue. It is rare that software is the cause, but do not discount it completely. Corrupt software can cause panics as well as drivers that were designed for a different release of the operating system. If you have just installed some new hardware, which requires a relink and reboot, then your system panics, this is a good sign that there is a hardware problem. Check the release of the driver to make sure it is support. If it is, swap out the hardware.
If the kernel is in the process of panicking and something else occurs that would normally cause a panic, then a double panic occurs. Although this sounds a bit more serious, they may have the same cause. Therefore, treat a double panic as you would a single panic.
Okay, so we know what types of problems can occur. How do we correct them? If you have a contract with a consultant, this might be part of that contract. Take a look at it and read it. Sometimes the consultants are not even aware of what is in their own contracts. I have talked to customers who have had consultant charge them for maintenance or repair of hardware, insisting that it was an extra service. However, the customer would whip out the contract and show them that this was included.
If you are not fortunate to have such an expensive contract, then you will obviously have to do the detective work yourself. If the printer catches fire, then it is pretty obvious where the problem is. However, if the printer just stops working figuring out what is wrong is often difficult. Well, I like to think of problem solving the way Sherlock Holmes described it in "The Seven Percent Solution" (and maybe other places):
"Eliminate the impossible and whatever is left over, no matter how improbable, must be the truth."
Although this sounds like a basic enough statement, it is often difficult to know where to begin eliminating things. In simple cases, we can begin by eliminating almost everything. For example, suppose we were having system hangs every time we used the tape drive. It would be safe at this point to eliminate everything but the tape drive. So, the next big question is whether it is hardware problem or not.
Potentially that portion of the kernel containing the tape driver was corrupt. In this case, simply relinking the kernel was enough to correct the problem. Therefore, when you relink, you link in a new copy of the driver. If that is not sufficient, then restoring the driver from the distribution media is the next step. However, based on your situation, checking the hardware might be easier, depending on your access to the media.
If this tape drive requires its own controller and you have access to another controller or tape drive, you can swap components to see if the behavior changes. However, just like you don't want to install multiple pieces of hardware at the same time, you don't want to swap multiple pieces. If you do and the problem goes away, was it the controller or the tape drive? If you swap out the tape drive and the problem goes away that would indicate that the problem was in the tape drive. However, does the first controller work with a different tape drive? You may have two problems at once.
If you don't have access to other equipment that you can swap, then there is little that you can do other than verifying that it is not a software problem. I have had at least one case while in SCO Support where a customer would call in insisting that our driver was broken because he couldn't access the tape drive. Since the tape drive worked under DOS and the tape drive was listed as supported, either the doc was wrong or something else was. Relinking the kernel and replacing the driver had no effect. We checked the hardware settings to make sure there were no conflicts, but everything looked fine.
Well, we had been testing it using tar the whole time, since tar is quick and easy when you are trying to do tests. When we ran a quick test using cpio, the tape drive worked like a champ. When we tried outputting tar to a file, it failed as well. Once we replaced the tar binary, everything worked correctly.
If the software behaves correctly, then there is the potential for conflicts. This only occurs when adding something to the system. If you have been running for some time and suddenly the tape drive stops working, then it is unlikely that there are conflicts. Unless, of course, you just added some other piece of hardware. If problems arise after adding hardware, remove it from the kernel and see if the problem goes away. If they don't go away, remove the hardware physically from the system.
Another issue that people often forget is cabling. I have done it myself where I had a new piece of hardware and after relink and reboot, something else doesn't work. After removing it again, the other piece still doesn't work. What happened? When I added the hardware, I loosened the cable on the other piece. Needless to say, pushing the cable back in fixed my problems.
I have also seen cases where the cable itself is bad. One support engineer report a case to me where just pin 8 was bad. Depending on what was being done, the cable might work. Needless to say, this problem was not easy to track down.
Potentially the connector on the cable is bad. If you have something like SCSI, where you can change the order on the SCSI cable without much hassle this is a good test. If you switch hardware and the problem moves from one device to the other, this could indicate one of two things. Either the termination is messed or the connector is bad.
If you do have a hardware problem often times it is the result of a conflict. If your system has been running for a while and you just added something, then it is fairly obvious what is conflicting. If you have trouble installing, then it is not always clear. In such cases, the best thing is to remove everything from your system that is not needed for the install. In other words, strip your machine to the "bare bones" and see how far you get. Then add one piece at a time, once the problem re-occurs, you know you have the right piece.
When trying to track down a problem yourself, remain calm. Keep in mind that if the hardware or software is as buggy as you now think it is, the company would be out of business. It's probably one small point in the doc that you skipped over (if you even read the doc) or there is something else in the system conflicting with it. Getting upset does nothing for you. In fact, (speaking from experience) getting upset can cause you to miss some of the details that you're looking for.
As you are trying to track down the problem yourself, examine the problem carefully. Can you tell if there is a pattern to when/where the problem occurs? Is the problem related to a particular piece of hardware? Is it related to a particular software package? Is it related to the load that is on the system? Is it related to the length of time the system has been up? Even if you can't tell what the pattern means, the support rep has one or more pieces of information to help in tracking down the problem. Did you just add a new piece of hardware or SW? Does removing it correct the problem? Did you check to see if there are any HW conflicts such as base address, interrupt vectors and DMA channels?
I have talked to customers who were having trouble with one particular command. They insist that it does not work correctly and therefore the is a bug in either the software or the doc. Since they were reporting a bug, we allowed them to speak with a support engineer even though they did not have a valid support contract. They keep saying that the doc sucks because the SW did not work the way it was described in the manual. After pulling some teeth, I discovered that the doc the use is for a product that was several years old. In fact, there had been three releases since then. They were using the latest software, but the doc was from the older release. No wonder the doc didn't match the software.
It may happen that the system crash you just experienced, no longer allows you to boot your system. What then? The easiest solution (at least easiest in terms of figuring out what to do) is reinstalling. If you have a recent backup and your tape drive is fairly fast, this is a valid alternative, provided there is no hardware problem causing the crash.
In an article I wrote for SCO's DiSCOver magazine, I compared a system crash to an earthquake. The people that did well after the 1989 earthquake in Santa Cruz were the ones that were most prepared. The people that do well after a system crash are also the one that are best prepared. Like an earthquake, the first few minutes after a system crash are crucial. The steps you take can make the difference between a quick, easy recovery and a forced re-install.
In previous sections we talked about the different kinds of problems that can happen on your system, so there is no need to go over them again here. Instead we will concentrate on the steps to take after you reboot your system and find that something is wrong. It's possible, that when you reboot, all is well and it will be another six months before that exact same set of circumstances occurs. On the other hand, your screen may be full of messages as it tries to brings itself up again.
Because of the urgent nature of system crashes and the potential loss of income, I decided that this was one troubleshooting topic I would hold your hand on. There is a set of common problems that occur after a system crashes that need to be addressed. Although the cause of a the crash can be a wide range of different events, the results of the crash is small by comparison. With this in mind, and the importance of getting your system running again, this is one place where I am going to forget what I said about giving you cookbook answers to specific questions.
Let's first talk about those cases where we can no longer boot at all. Therefore, you need to think back to our discussion of starting and stopping the system and consider the steps the system goes through when booting. I talked about them in details before, so I will only review them here as necessary to describe the problems.
As I mentioned, when you power on a computer, the first thing is the Power-On Self-Test, or POST. If something is amiss during the POST, you will usually hear a series of beeps. Hopefully, there will be some indication on your monitor of what the problem is. It can be anything from incorrect entries in your CMOS to bad RAM. If not, maybe the hardware documentation has something about what the beeps mean.
When finished with the POST, the computer executes code that looks for a device from which it can boot. On an ODT or OpenServer system, this boot device will more than likely be the hard disk. The built in code finds the active partition on the hard disk and begins to execute the code at the beginning of the disk. What happens if the computer cannot find a drive to boot from is dependent on your hardware. Often there will be a message indicating that there is no bootable floppy in drive A. It is also possible that the system simply hangs.
If you have a hard disk installed and it should contain valid data, then potentially your masterboot block is corrupt. If you created the boot/root floppy set like I told you, then you can use fdisk from it to recreate the partition table using the values from your notebook. Load the system from your boot/root floppy set and run fdisk.
This is done like from the hard disk. With the floppy in the drive you boot your system. When you get to the Boot: prompt, you simply press entry. After loading the kernel it prompts you to insert the root filesystem floppy. You do that and press enter. A short time later, you are brought to # prompt, from where you can begin to issue commands.
When you run fdisk, what you will probably see is an empty table. Because you made a copy of your partition table in your notebook like I told you to do, you simply fill in the values exactly the way they were before. Be sure that you make the partition active that was previously so. Otherwise, you won't be able to boot or you could still boot but you corrupt your filesystem. When you exit fdisk, it will write out a copy of the master boot block to the beginning of the disk. When you reboot, things will be back to normal.
(I've talked to aleast one customer who literally laughed at me when I told him to do this. He insisted that it wouldn't work and that I didn't know what I was talking about. Fortunately for me, each time I suggested it, it did work. However, I have worked on my machines myself where it didn't work. With a success rate well over 50%, it's obviously worth a try.)
However, if you did not follow my friendly advice and write down the fdisk parameters, all is not lost. When you installed a copy of you SCO system it made a copy of the masterboot block and stored it as /etc/masterboot. When you create a boot/root floppy set, this file is copied onto the root floppy. By using dd¸ which is also on your root floppy, you can rewrite the masterboot block. The command would be:
/bin/dd bs=376 count=1 if=/etc/masterboot of=/dev/rhd00
This means that dd will copy one 376-byte block from /etc/masterboot to /dev/rhd00, which is your boot hard disk. Be careful when you type this. If you mis-type the size and make it 736 or type count=10 you'll need to get out your installation media and start over.
One thing that I would like to point out is that the /etc/masterboot file is not updated. If you still have root on your hard disk and later add partitions, you could change the if and of entries in the above command, which will update the /etc/masterboot file to reflect your current masterboot block. If you have written down the configuration, then this is not a problem. You can have a corrupt or outdated /etc/masterboot and it won't matter. You always use fdisk and input the values by hand.
Some of you might be thinking that once the system is installed, the partition table isn't going to change. Well, if you have used the entire partition for UNIX, this is probably true. However, if you are like me and have multiple operating systems, the issue is not so simple. Once, I had two DOS partitions and a single ODT 3.0 partition on my first drive. When OpenServer arrived I did not want to remove ODT, instead I reconfigured my DOS partitions so that the first was smaller and I moved everything in the second partition onto a new drive. I then installed OpenServer on the left over space. From OpenServer's perspective, /etc/masterboot is still valid. However, from ODT's perspective it is not.
If the hardware sees that you have a hard disk, but cannot find a valid, active partition, you may see the message:
This message is the result of either a corrupt masterboot block, or it is the result of a corrupt boot0. If caused by a corrupt masterboot block, you can boot from the boot/root floppy and recreate it as I described above. Also if the driver parameters are wrong, you end up looking at some other part of the disk from where you should and get this message. If you have a SCSI hard disk, then it is unlikely that this message is the cause of the problem.
The first thing to check is the hard disk parameters (assuming you don't have a SCSI hard disk). To see what the current parameters are, type:
dkinit 0 0
which indicates the first drive on the first controller. You then have a menu from which you select option 1 to display the current disk parameters. If these parameters match your hard disk, then all is well. Leave them alone. If they don't match you can modify them with the correct values. How do you know what the correct values are? These are one of those things that I told you to write down in your notebook.
You may find that the partition table was valid. If so, the problem was more like the result of either a corrupt boot0 or boot1. Unfortunately there is no magic command you can run to replace it like the masterboot block. However, there are also copies of them on the system, so you can use dd to copy them onto the hard disk, like this:
/bin/dd if=/etc/hdboot0 of=/dev/hd0a
/bin/dd if=/etc/hdboot1 of=/dev/hd0a bs=1k seek=1
If you think back to the section where we discussed the hard disk layout, you'll remember that boot0 is a relatively small section on the hard disk. You see that the size of /etc/hdboot0 is less than 1K. When we dd it out to the hard disk, dd only writes as many bytes are are in the file. Next, we have boot1. This is a little large, but starts 1K from the beginning of the partition. That's why we need to seek in 1 block. Here we set the block size to 1K (bs=1k). Not that if you made a mistake a put in a block size of 2, then dd would seek in 2K before starting to right and you would end up overwritting something.
At this point, the system is trying to load and run the first real program: /boot. If you are running ODT, then this is in the root directory of your root filesystem and on OpenServer, this is in the root directory of the /dev/boot filesystem. For simplicity's sake, let's just call these both the "boot" filesystem. Since the two operating systems use the same files to boot and just their name is different, calling each the boot filesystem makes life easier.
If boot1 runs into trouble loading /boot, there are several things that could cause this. The easiest to correct is if the division table has become corrupt. If you run divvy from your boot/root floppy set and see an empty table, like the fdisk table, you can simply input the values from your notebook. (Are you beginning to understand why this notebook is so important?) Unfortunately, a copy of the division table is not something that is kept. If your division table is messed up and you didn't write down the values, you're hosed.
One thing is very important when inputting the value into the division table and that is don't create the filesystem. Only put in the values for the starting and ending block. You don't even have to name them. (Remember the name comes from the device node?) If you create them, a new filesystem will be created. This means that all the data will be lost.
Well, not entirely. You see all that is done is that the inode table gets re-created. It's not like the disk is formatted. This is similar to a quick format under DOS where the FAT is overwritten. The data is still there, but there are no pointers to the data.
I accidentally did this on my own system with no backups. Since this was the second hard disk and there was only one filesystem, things were easier. The data was only text files and the disk was relatively unfragmented, so I used a series of dd commands to write 1Mb files onto my other partition. I could then look through these files and decide if there was anything of value in them. It took several hours, but I estimate I had a recovery rate of at least 95%. Not bad, but I wouldn't recommend trying it yourself.
If /boot is not there or otherwise can't be loaded you end up with a message similar to:
boot not found
Stage 1 boot failure: error loading hd(40)/boot
Since you need /boot to boot off the floppy, then this is a logical place to get it from. The first thing you need to do is to boot from the boot/root floppies, of course. Since you are actually copying files and not dd'ing the contents onto the hard disk, you have to first mount the boot filesystem. If the boot file system went down dirty, then the mount will fail since you need to clean in first. More than likely /dev/boot on OpenServer is clean since /dev/boot is normally mounted read-only. Even if it was clean, there is no harm in cleaning it again. Therefore, before I mount the boot filesystem, I always run fsck.
If there were a lot of pipes open when the system went down, there will be a lot of unreferenced files. Therefore, you might find yourself pressing y to the prompt to clear all these unreferenced, zero length files. Instead, you could start fsck with the -y option to have it assume "yes" to each prompt. The question is: "Do you feel lucky?"
Actually it isn't that bad. If the file contains data, it ends up in /lost+found if not, it gets trashed. Why do you want a lot of zero length files taking up vital directory entries in lost+found?
My suggestion is you mount the boot filesystem like this if you are running ODT:
/etc/mount /dev/root /mnt
and like this, if you are running OpenServer:
/etc/mount /dev/boot /mnt
Now you can copy /boot like any other program:
/bin/cp /boot /mnt
Now is a good time to check and see if there is a kernel on the boot hard disk. It doesn't have to be the most recent one (probably called unix), but anything in the root directory of the boot filesystem. (e.g., unix.old, unix.orig, unix.N1). If so, you can copy that to /mnt/unix. Once you get past /boot and can load your kernel, you are on the hard disk and have a lot more options in terms of how to proceed.
If you discover that your root filesystem is too corrupt too mount and fsck fails, you can try to mount the filesystems as read-only and dump the filesystem to tape. If this works, you, at least, have access to your files.
One thing I didn't mention, yet, is that we don't have to copy many of these things from the floppy to the hard disk. He can take advantage of some of the book magic I talked about in the section on starting and stopping your system. Assume that just the masterboot block is corrupt, so we can't boot from the hard disk. If we can get to the Boot: prompt from the floppy, then we can access the kernel and the root filesystem on the hard disk. For example, if we had OpenServer, at the Boot: prompt, we could type in:
hd(40)unix swap=hd(41) dump=hd(41) root=hd(42)
In each case, it uses the hard disk (hd) driver, first getting the hard disk off of minor 40, uses minor 41 for the dump and swap devices, then uses minor 42 for the root filesystem. Notice that this takes into account the different filesystems that /unix is on versus the root filesystem. For ODT the same command would look like this:
hd(40)unix swap=hd(41) dump=hd(41) root=hd(40)
If for some reason, there was no kernel in the boot filesystem, you could change the location that Boot: gets the kernel. Hopefully, the one on the floppy works, so you could use that instead. Here the command might look like this for OpenServer:
fd(64)unix swap=hd(41) dump=hd(41) root=hd(42)
This takes the kernel from the floppy device (fd) with a minor number of 64, which is /dev/fd0.
If we wanted to extend this, we could take advantage of boot aliasing and modify /etc/default/boot on the floppy. We could create an alias that took the kernel and root filesystem from the hard disk. It might look like this:
HDUNIXROOT=hd(40)unix swap=hd(41) dump=hd(41) root=hd(42)
Therefore, when we boot off the boot floppy and get to the Boot: prompt, we just type in hdunixroot and it executes that bootstring.
If you have a system crash the "safest" thing is to reinstall. However, "safe" doesn't always mean best. If the crash was caused by a unique set of hardware and software circumstances that won't occur for another six months, then re-installing probably won't fix the problem. Even if you have the CD-ROM distribution, re-installing means several hours of down time. If you have the company president or a hoard of angry users breathing down your neck, this is not a realistic option. You need to get the computer up and running as soon as possible. You need to determine the problem, find a solution and get everyone else back to work.
If you try to boot from the hard disk and get either of these messages:
stage 1 boot failure not a directory
PANIC srmountfun- Error 6 mounting rootdev (1/40)
or any PANIC with srmountfun, then the best thing is probably to restore from backups. These are essentially saying that the system does not recognize the root filesystem. If that's the case, you cannot continue. Out of the dozens of crashes I've had to work users through, only once was this corrected by going through all of the steps I described above.
If you have good, current, reliable backups of your programs and data, then the most reliable method of crash recovery is to restore from backup. If not, there are many professional data recovery services, which stand a good to excellent chance of recovering your data. They are relatively expensive and the turn-around time may be a week or longer. If it is imperative that you recover your data, this is your best chance.
A common problem is that the system appears to hang when it reaches the prompt to press CTRL-D to continue or enter root password for maintenance mode. Neither CTRL-D nor the root password seem to work. Pressing ENTER would normally just repeat that message. However, that too appears not to work. This problems occurs often after the system crashes, as well as when it is shutdown improperly,
The cause of this problem is a corrupt (not missing) /etc/ioctl.syscon file. This controls I/O for the device /dev/syscon, which is what is accepting input at this moment. To correct the problem, you remove the file. When the system reboots, it sees the file is missing and recreates it. Fortunately, you don't have to boot from the floppy to remove it. The issue here is that the file is corrupt and not behaving correctly. Instead of pressing the ENTER key, you press CTRL-J. Everything else should would correctly. So, if you wanted to go into multi-user mode, pressing CTRL-D then CTRL-J starts you on your way to multi-user mode. If you enter the root password followed by the CTRL-J, this will bring you into maintenance mode.
Both the SCO OpenServer Handbook and ODT 3.0 System Administrator's Guide contains steps for recovery from additional boot problems. They also include many of the problems we discussed here, but I felt that the importance of the issue as well as the frequency of these problems warranted repeating the information.
Experience has taught me that sooner or later you will get to a problem that you cannot solve. No matter how many changes you make to the configuration files and no matter how many times you reboot, the problem just won't go away. If the problem is the result of a bug, then SCO Support will help you, even without a support contract. There are often patches available that fix many known problems. These are called Support Level Supplements, or SLSs. SLSs are available for download via UUCP, ftp and SCO's Web Server.
In addition to SLSs, there are Enhanced Features Supplements, or EFSs. These are not patches or bug fixes, but rather ehnancements to the system. As a result, many are not available free of change. Depending on where the EFS came from and what features it includes, you may have to pay for more than just the media costs, but also royalties or "development" costs.
SCO also provides free access to their Information Tools (IT) scripts via the SCO Web Server. This is a set of thousands of articles covering everything from a description virtual memory to details on how to get multiple SCO operating systems on the same partition to instruction on how to overcome bugs and other problems.
Available on CD is the SCO Support Services Library. This contains the IT scripts as well as a character-based and X-based search and viewing programs. Along with the IT scripts are many of the SLSs and EFS. This is available through a yearly subscription from SCO. Although the price might seem a bit much at first. It is well worth the money considering the time saved by having the information and supplements immediately available.
For more tips on getting help, I suggest you read the next chapter.
Table 0.1 Files Used In Problem Solving
Next: Getting Help
Copyright 1996-1998 by James Mohr. All rights reserved. Used by permission of the author.
Be sure to visit Jim's great Linux Tutorial web site at http://www.linux-tutorial.info/