Jim Mohr's SCO Companion

Index

Copyright 1996-1998 by James Mohr. All rights reserved. Used by permission of the author.

Be sure to visit Jim's great Linux Tutorial web site at http://www.linux-tutorial.info/


The Operating System and Its Environment

In this chapter we are going to go into some details about what makes up an SCO UNIX operating system. Here, I am not talking about the product SCO UNIX or either of the bundled products ODT or OpenServer. Here, I am talking strictly about the software that manages and controls your computer.

Since an operating system is of little use without hardware and other software. we are going to discuss how the operating system interacts with other parts of the ODT and OpenServer products We will also talk about what goes into to making the kernel, what components it is made of and what you can do to influence the creation of a new kernel.

Much of this information is far beyond what many system administrators are required to have for their jobs. So why go over it? Because what is required and what they administrator should know is two different things. Many calls I received while in SCO support could have been avoided by the administrator understanding the meaning of a message on the system console or the effects of making changes. By going over the details of how the kernel behaves, you will be in a better position to understand what is happening.

The Kernel - The Heartbeat of SCO UNIX

If any single aspect of the SCO product could be called "UNIX," then it would the kernel. Okay. So what is the kernel? Well, on the hard disk it is represented by the file /unix. (On SCO Open Server, this is a symbolic link to /stand/unix) Just as a program like /bin/date is a collection of bytes and isn't very useful until it is loaded in memory and running, the same applies to /unix.

However, once the /unix program is loaded into memory and starts its work it becomes "the kernel” and has many responsibilities. Perhaps the two most important are process management and file management. However, the kernel is responsible for many other things. One aspect is I/O management, which essentially the accessing of all the peripheral devices. The kernel is also responsible for security, which is ensuring that only those users authorized to gain access to the system and once in that they only do what they should.

Processes

From the user's perspective, perhaps the most obvious aspect is process management. This is the part of the kernel that ensures that each process gets its turn to run on the CPU. This is also the part that makes sure that the individual process don't go around "trouncing” on other processes by writing to areas of memory that belong to someone else. To do this, the kernel keeps track of many different structures that are maintained both on a per user basis as well as system wide.

As we talked about in the section on operating system basics, a process is the running instance of a program (a program simply being the bytes on the disks). One of the most powerful aspects of SCO UNIX is it's ability to not only keep many processes in memory at once, but to switch to them fast enough to make it appear as if they were all running at the same time.

As a process is running, it works within its context. It is also common to say that the CPU is operating within the context of a specific process. The context of a process are all of the characteristics, settings, values, etc which that particular program uses as it runs, as well as those that it needs to run. Even the internal state of the CPU and the contents of all its registers are part of the context of the process. When a process has finished having its turn on the CPU, and another process gets to run, the act of changing from one process to another is called a context switch.


Figure 0-1 Context Switch

We can say that a process' context is defined by its uarea (also called its ublock). The uarea contains information such as the effective and real UID, effective and real GID, system call error return value and a pointer to the system call arguments on that process's system stack.

The structure of the uarea is defined by the user structure in <sys/user.h>. There is a special part of the kernel's private memory that holds the uarea of the currently running process. When a context switch occurs, it is the uarea that is switched out. All other parts of the process remain where they are. The uarea of the next process is copied into the exact same place in memory as the uarea for the old process. This way the kernel does not have to make any adjustments and knows exactly where to look for the uarea. It will always be able to access the uarea of the currently running process by accessing the same area in memory.

One of the pieces of information that the uarea contains is the process' Local Descriptor Table (LDT). A descriptor is a 64 bit data structure that is used by the process to gain access to different parts of the system. That is, different parts of memory or different segments. Despite a common misunderstanding, SCO UNIX does use a segmented memory architecture. In older CPUs, segments were a way to get around memory access limitations. By referring to memory addresses as offsets within a given segment, more memory could be addressed, them if memory were looked at as a single block. The key difference with SCO UNIX is that each of these segments are 4GB and not the 64K they were originally.

The descriptors are held in descriptor tables. The LDT is used to keep keeps track of a process' segments, also called a region. That is, these are descriptors that are local to the process. The Global Descriptor Table (GDT) keeps track of the kernel's segments. Since there are many processes running, there will be many LDTs. These are part of the process' context. However, there is only one GDT as there is only one kernel.

Within the area (also called ublock) is a pointer to another key aspect of a process' context: its Task State Segment TSS. The TSS contains all the registers in the CPU. It is the contents of all the registers which defines the state that the CPU is currently running in. In other words, the registers say what a given process is doing at any given moment, Keeping track of these registers is obviously vital to the concept of multi-tasking.

By saving the registers in the TSS, you can reload them when this process gets its turn again and continue where you left off. This is because all of the registers are reloaded to their previous value. Once reloaded the process simply starts over where it left off as if nothing had happened. If you are curious about all that the TSS holds, take a look in <sys/tss.h>.

This brings up two new issues: system calls and stacks. A system call is a programming term for a very low-level function. These are functions that are "internal” to the operating system and are what are used to access the internals of the operating system, such as in the device drivers that ultimately access the hardware. Compare this to library calls, which are made up of systems calls.

A stack is a means of keeping track where a process has been. Like a stack of plates, objects are pushed onto the stack and popped off the stack. Therefore, things that are pushed onto the stack last are the first things popped off. When calling routines, certain values are pushed onto the stack for safe-keeping. These include the variables to be passed to the function, plus the location the system should return to after completing the function. When returning from that routine, these values are retrieved by popping them off the stack.

If you look in <sys/user.h> at the size of the system call argument pointer (u_ap) you will see it is only large enough to hold six arguments, each four bytes long. Therefore, you will never see a system call with more than six arguments.

Part of the uarea is a pointer to that process' entry in the process table. The process table, as its name implies, is a table containing information about all the processes on the system whether that process is currently running or not. Each entry in the process table is defined in <sys/proc.h>. The principle that a process may be in memory, but not actually running is important and we will get into more details about the life of a process shortly.

In ODT 3.0 and earlier, the size of this table was a set value and determined by the kernel parameter NPROC. Although you could change this value, you needed to build a new kernel and reboot for the change to take effect. If all the entries in the process table are filled and someone tries to start a new process, it will fail with the error message:

newproc - Process table overflow ( NPROC = x exceeded)

where x is the defined value of NPROC.

One nice thing is that if all but the last slot are taken up, only a process with the UID of root can take it. This prevents a process from creating more and more process and stopping the system completely. Thus, the root user has one last chance to stop it. OpenServer changed that with the introduction of dynamically configured parameters. Many of the parameters that had to be configured by hand, will now grow as the need for them increases.

Just how is a process created? Well, the first thing that happens that one process uses the fork() system call. Like a fork in the road it starts off as a single entity and then splits into two. When one process uses the fork() system call, an exact copy of itself is created in memory and the uareas are essentially identical. The value in each CPU register is the same, so both copies of this process are at the exact same place in their code. Each of the varaibles also have the exact same value. There are two exceptions: the process ID number and the return value of the fork() system call.


Figure 0-2 Creating a new process

Like users and their UID, each process is referred to by its process ID number, or PID. This is a unique number which is actually the process' slot number in the process table. When a fork() system call is made, the value returned by the fork() to the calling process is the PID of the newly created process. Since the new copy didn't actually make the fork() call, the return value in the copy is 0. This is how a process spawns or forks a child process. The process that called the fork() is the parent process of this new process, which is the child process. Note that I intentionally said the parent process and a child process. A process can fork many child processes, but has only one parent.

Almost always, a program will keep track of that return value and will then change its behavior based on that value. One of the most common things is for the child to issue an exec() system call. Although it takes the fork() system call to create the space that will be utilized by the new process, it is the exec() system call that causes this space to be overwritten with the new program.

At the beginning of every executable program is an area simply called the "header”. This header describes the contents of the file: how the file is to be interpreted. This could be information to tell the system that the file is a 286 or 386 binary, the size of the text and data segments or where the symbol table is. The symbol table is basically the translation from variable names that we humans understand to the machine language equivalents.

The header contains the locations of the text and data segments. As we talked about before, a segment is a portion of the program. The portion of the program that contains the executable instructions is called the text segment. The portion containing pre-initialized data is the data segment. Pre-initialized data are variables, structures, arrays, etc. that have their value already, set even before the program is run. The process is given descriptors for each of the segments. These descriptors are referred to as region descriptors as under SCO UNIX segments are more commonly referred to as regions.

In contrast to other operating systems running on Intel-based CPUs, SCO UNIX has only one region (or segment) each for the text, data and stack. The reason that I that I didn't mention the stack region until now, because the stack region is created when the process is created. Since the stack is used to keep track of where the process has been and what it has done, there is no need create it until the process starts.

Another region that I haven't talked about until now, is not always used. This is the shared data region. Shared data is an area of memory that is accessible by more than one process. Do you remember from our discussion on operating system basics when I said that part of the job of the operating system was to keep processes from accessing areas of memory that they weren't supposed to? So, what if they want to? What if they are allowed to? That is where the shared data region comes it.

If one process tells the other where the shared memory region is (by giving a pointer to it), then any process can access it. The way to keep unwanted processes away is simply not to tell them. In this way, each process that is allowed can use the data and the region only goes away when that last process disappears. How several processes would look in memory might look like Figure 0-3.


Figure 0-3 Process Regions

If we take a look at Figure 0-3, we see three processes. In all three instances, each process has it's own data a stack regions. However, process A and process B share a text region. That is, process A and process B have called the same executable off the hard disk. Therefore, they are sharing the same instructions. Note that in reality, this is much more complicated since the two process may be not be executing the exact same instructions at any given moment.

Each process has at least a text, data and stack region. In addition, each process is created in the same way. An existing process will (normally) use the fork()-exec() system call pair to create another process. However, this brings up an interesting question, similar to "Who or what created God?” If every process has to be created by another, then who or what created the first process?

When the computer is turned on, it goes through some wild gyrations that we will talk about later. At the end of the boot process the system loads and executes the /unix binary, the kernel itself. One of the last things the kernel does is to "force” the creation of a single process, which then become the great-grandparent of all the other processes.

This first, primordial processes is the sched process, also referred to as the swapper and has a PID of 0. Its function is to free up memory, by hook or by crook. If another process can spare a few pages it will take those. If not, sched may swap out an entire process to the hard disk. Hence the name. Sched is only context switched in when the amount of free memory is less than the running process needs.

The first created process is init, with a PID of 1. All other processes can trace their ancestry back to init. Init's job is basically to read the entries in the file /etc/inittab and execute different programs. One of the things it does is to start the getty program on all the login terminals, which eventually provides every user with their shell.

Another system process is vhand, whose PID is 2. This is the paging daemon or page stealer. If free memory on the system gets below a specific low water mark (the kernel tunable parameter GPSLO), then vhand is allowed to run every clock interrupt. (Whether it runs or not will depend on many things which we will talk about later) Until the amount of free memory gets above a high water mark (GPGSHI), vhand will contain to "steal" pages.

Next is bdflush, with a PID of 3. This is the buffer flushing daemon. It's job is to clean out any "dirty” buffers inside of the system's buffer cache. A dirty buffer is one that contains data that has been written to by a program, but hasn't yet been written to the disk. It is the job of bdflush to write this out to the hard disk (probably) at regular intervals. These intervals are defined by the kernel tunable parameter BDFLUSHR, which has a default of 30 seconds. The kernel tunable parameter NAUTOP specifies how long the buffer must have been dirty before it is "cleaned”. (Note that by "cleaning”, I don't mean that the data is erased. It is simply written to the disk.)

All processes, including the ones I described above, operate in one of two modes: user and system mode. In the section on the CPU we will talk about the privilege levels. An Intel 80386 and later has four privilege levels, 0-3. SCO UNIX uses only the two extreme most: 0 and 3. Processes running in user mode are running at privilege level 3 within the CPU. Processes running in system mode are running at privilege level 0. (More on this in a moment)


Figure 0-4 Process modes

In user mode, a process is executing instructions from within its own text segment, it references its own data segment and uses its own stack. Processes switch from user mode to kernel mode by making system calls. Once in system mode, the instructions executed are those within the kernel's text segment, the kernel's data segment is used, and a system stack is used within the process' uarea.

Although the process is going through a lot of changes when it makes a system call, keep in mind that this is not a context switch. This is still the same process. It's just that it is operating at a higher privilege.

The Life-Cycle of Processes

From the time a process is fork()'ed into existence, until the time it has completed it's job and disappears from the process table, it goes through many different states. The state a process is in changes many times during it's "life”. These changes can occur, for example, when the process makes a system call, it is someone else's turn to run, an interrupt occurs or the process asks for a resource that is currently not available.

A commonly used model shows processes operating in one of eight separate states. However, in the file <sys/proc.h> there are only seven separate processes listed. However, to understand the transitions better, the following eight states are used:


  1. executing in user mode

  2. executing in kernel mode

  3. ready to run

  4. sleeping in main memory

  5. read-to-run , but swapped out

  6. sleeping, but swapped out

  7. newly created, not ready to run and not sleeping

  8. issued exit system call (zombie)

The states listed here are intended to describe what is happening conceptually and not to indicate what "official" state a process is in. The official states are listed in table Table .0.1.

SLEEP

awaiting an event

SRUN

running

SZOMB

terminated but not waited for

SSTOP

process stopped by a debugger

SIDL

process being created

SONPROC

process is on the processor

SXBRK

process need more memory

Table .0.1 Process States

In my list of 8 states there was no mention of a processes actually being on the processor (SONPROC). Processing that are running in kernel mode or running in user mode are both in the SRUN state. Although there is no 1:1 match-up, hopefully you'll see what each state means as we go through the following description.

Figure 0-5 Process States

A newly created process enters the system in state 7. If the process is simply a copy of the original process (a fork but no exec), it then begins running in the state that the original process was (1 or 2). (Why none of the other states? It has to be running in order to fork a new process.) If an exec() is made, then this process will end up in kernel mode (2). It is possible that the fork()-exec() was done in system mode and the process never goes into state 1. However, this highly unlikely.

When running, an interrupt may be generated (more often than not this is the system clock) and the currently running process is pre-empted (3). This is the same state as state 3, since it is still ready-to-run and in main memory. The only difference is that the process just got kicked off the processor.

When the process makes a system call while in user mode (1), it moves into state 2 where it begins running in kernel mode. Assume at this point that the system call made was to read a file on the hard disk. Since the read is not carried out immediately, the process goes to sleep awaiting on the event that the system has read the disk and the data is ready. It is now is state 4. When the data is ready, the process is woken up. This does not mean it runs immediately, but rather it it once again ready to run in main memory (3).

If sched discovers that there is not enough memory for all the processes, it may decide to swap out one or more processes. The first choice is those that are sleeping, since they are not ready to run. Such a process now finds itself in state 6, where it is sleeping, but swapped out. It is also possible that there are no processes that are sleeping, but a lot of processes that are ready to run, so sched needs to swap one of those out instead. Therefore, a process could move from state 3 (ready to run) to state 5 (ready to run, but swapped out). However, as we'll see in a moment most process are sleeping.

If a process that was asleep is woken up (perhaps the data is ready, it moves from state 4 (sleeping in main memory) to state 3 (ready to run). However, a process cannot move directly from state 6 (sleeping, but swapped) to state 3 (ready to run). This requires two transitions. Since it is not effective to swap in process that are not ready to run, sleeping processes will not be swapped in. This is simply because the system has no way of knowing that that this process will be awoken soon, since swapping and waking a process are two separate actions. Instead, the system must first make the process ready to run, so it moves from state 6 (sleeping, but swapped) to state 5 (ready to run, but swapped). When the process is swapped back in, it returns to where it was. This can be either user mode (1) or kernel mode (2).

Processes can end their life by either explicitly calling the exit() system call or having it called for them. The exit() system call releases all the data structures that the process was using. One exception is the slot in the process table. This is the responsibility of the init process. The reason for hanging around is that the slot in the process table is used for the exit code of the exiting process. This can be used by the parent process to determine if the process did what it was supposed to or it ran into problems. The process shows that it has terminated by putting itself into the state 8, and it becomes a "zombie". Once here, it can never run again as nothing exists other than the entry in the process table.

This is the reason why you cannot "kill” a zombie process. There is nothing there to kill. In order to kill a process, you need to send it a signal (more on them later). Since there is nothing there to receive or process that signal, trying to kill it makes little sense. The only thing to do is to let the system clean up.

If the exiting process has any children, they are "inherited” by init. One of the values stored in the process structure is the PID of that process' parent process. This value is (logically) referred to as the parent process ID or PPID. When a process is inherited by init, the value of their PPID is changed to 1 (the PID of init).

A process's state change can cause a context switch in several different cases. One is when the processes voluntarily goes to sleep. This can happen when the process needs a resource that is not immediately available. A very common example is your login shell. You type in a command, the command is executed and you are back to a shell prompt. Between the time the command is finished and you input your next command a very long time could pass — at least two or three seconds.

Rather than constantly checking the keyboard for input, the shell puts itself to sleep waiting on an event. That event is an interrupt from the keyboard to say "Hey! I have input for you.” When a process puts itself to sleep, it sleeps on a particular wait channel (WCHAN). When the event occurs that is associated with that wait channel, every process waiting on that wait channel is woken up.

There is probably only one process waiting on input from your keyboard at any given time. However, many processes could be waiting for data from the hard disk. If so, there might be dozens of processes all waiting on the same wait channel. All are awoken when the hard disk is ready. It may be that the hard disk has read only the data for a subset of the processes waiting. Therefore, (if the program is correctly written) the processes check to see if their data is ready for them. If not, they put themselves to sleep on the same wait channel.

When a process puts itself to sleep, it is voluntarily giving up the CPU. It may be that this process had just started its turn when it noticed that it didn't have some resource it needed. Rather than forcing the other process to wait until the first one got its "fair share” of the CPU, that process is being nice and letting some other process have a turn on the CPU.

Because the process is being so nice to let others have a turn, the kernel is going to be nice to the process. One of the things the kernel allows is that a process which puts itself to sleep can set the priority that it will run at when it awakens. Normally, the kernel process scheduling algorithm calculates the priorities of all the processes. However, in exchange for voluntarily giving up the CPU, the process is allowed to choose it's own priority.

Process Scheduling

Scheduling processes is not as easy as finding the one that has been waiting the longest. Older operating systems used to do this kind of scheduling. It was referred to as "round-robin”. The processes could be thought of as sitting in a circle. The scheduling algorithm could them be though of as a pointer that moved around the circle getting to each process in turn.

The problem with this type of scheduling is that you may have ten processes all waiting for more memory. Remember from our discussion above that the sched process is responsible for finding memory for processes that don't have enough. Like other processes, sched needs a turn on the processor before it can do its work. If every processes had to wait until sched ran, then those that run right before sched may never run.

This is because sched would run and free up memory. By the time it got around to the processes just before sched, all the free memory would be taken. The processes that come after sched have to wait again. Next, sched runs again, frees memory, which is then take up by the other process.

Instead, SCO UNIX uses a scheduling method that takes many things into consideration, not just the actual priority of the process itself or how long it has been since it had a turn on the CPU. The priority is a calculated value, based on several factors. The first is what priority the process already has.

Another factor is recent CPU usage. The longer a process runs, the more it used the CPU. Therefore, to make it easier for faster jobs to get in and out quickly, the longer a process runs the lower it's priority. For example, a process that is sorting a large file needs a longer time to complete that the date command. If the process executing the date command is given a higher priority due to the fact that it is relatively fast, it leaves the process table quickly so there are less processes to deal with and every process gets a turn more often.

Keep in mind that the system has no way of knowing in advance just how long the date command will run. Therefore, the system can't give it a higher priority. However, by the fact that it gives the sort processes a lower priority after it has been running a while, the date command appears to run faster. I say "appear" since the date command needs the same number of CPU cycles to complete. It just gets them sooner.

SCO UNIX also allows you to be nice to your fellow processes. If you feel that your work is not as important as someone else's, you might want to considering being nice to them. This is done with the nice command. The syntax of which is:

nice <nice_value> <command>

For example, if you wanted to run the date command with a lower priority, it could be run like this:

nice -10 date

This has the effect of decreasing the start priority of the date command by 10. Note that only root can increase a process' priority, that is use a negative nice value. The nice value only affects running processes, but child processes inherit the nice value of their parent. By default processes started by users have a nice value of 20. This can be changed within sysadmsh in ODT 3.0 and with the usermod command in OpenServer. Therefore, using the nice command in the example above increases the start up priority to 30. OpenServer has provided a tool (renice) which allows you to change the priority of running process. See the appropriate man-pages for more details.

The formula used to determine a processes priority is:

priority = ( recent_cpu_usage /2) + nice_value + 40

Note that the value 40 is hard coded in the calculation. This is because the highest priority a user process can have is 40. Those of you who have been paying attention might have noticed something odd. If you add the recent CPU usage to the nice value, then add 40, the longer the process is on the CPU the higher the priority.

Well, sort of.

The numeric value calculated for the priority is the opposite of what we normally think of as priority. A better way if thinking about it is like the pull-down numbered tickets you get at the ice cream store. The lower the number, the sooner you get served. So it is for processes, as well.

Because of the way the priority is calculated, process scheduling in SCO UNIX has a couple of interesting properties. First, as time passes the priority of a process running in user mode will decrease. The decrease in priority means that it will be less likely to run the next time there is a context switch. Conversely, if a process hasn't run for a long time, the recent CPU usage is lower. Since it hasn't used the CPU much recently, its priority will increase. Such processes are more likely to run than those than just got their turn. (Assuming all other factors are the same) Figure 0-6 shows three processes as they run and their priorities. This will give you an idea of how priorities change over time.


Time (Secs)

Process A

Process B

Process C


priority

CPU

priority

CPU

priority

CPU

0



1

60

0

1

.

80

60

0

60

0

1



2

80

40

60

0

1

.

80

60

0

2



3

70

20

80

40

60

0

1

.

80

3



4

65

10

11

.

80

70

20

80

40

4




5

80

40

65

10

11

.

80

70

20

5



6

70

20

80

40

65

10

11

.

80

Figure 0-6 Changes in Process Priority

This table makes several large assumptions, which are probably not true for a "real” system. The first is that these are the only three processes running on your system. Since even in maintenance mode, you have at least four running, this is unrealistic. The second assumption is that all processes start with an initial priority of 40 and nothing has changed their nice value of 20. We also assume that there is nothing else on the system that would change the flow of things such as an interrupt from a hardware device.

Another assumption is that each process gets to run for a full second. The maximum time a process gets the CPU is defined by the MAXSLICE kernel tunable, which by default is 1 second. The number of times the clock interrupts per second, and therefore the numbers of times the priority is recalculated is defined by the HZ system variable. This is defined by default to be 100HZ, or 100 times a second. However, we are assuming that the priorities are only calculated once a second instead of 100 times.

If we look a process A, which is the first to run, we see that between second 0 and 1, its CPU usage went from 0 to the maximum of 80. When the clock tick occurred (the clock generated an interrupt), the priorities of all the processes are recalculated. Since our CPU usage is at 80, it is first decayed, then we cut it in half to get 40, add the nice value of that to the 40 that's hard coded to give 80 as the new priority for Process A. Since Processes B and C haven't run yet, their CPU time is 0. So we only add their nice value of 20 to the static value of 40 to keep them at 60.

Since Process A is now at priority 80, both Process B and Process C have higher priorities. So now let's say B runs. Between seconds 1 and 2, process B changed just like Process A did between seconds 0 and 1.

When the clock tick occurs, Process B's priority is calculated just like Process A's. Process B is now at 80. However, Process A had it's priority recalculated as well. It is now at 70.

All this time, Process C was not on the CPU, therefore its priority hasn't changed from the original 60. Since this is the lowest value, it has the highest priority and now gets a turn on the CPU. When its turn is finished (at the end of second 2) and priorities are recalculated, Process C is now at 80. Process B is at 70, but Process A has been recalculated to be 65. Since this is the lowest, Process A get a chance to run again.

In reality, things are not that simple. There are dozens of processes competing for the CPU. There are different start-up priorities, different nice values and different demands on the system. These demands include requests for services from peripherals such as hard disks. Responding to these requests can almost instantly change which process is running.

Interestingly enough, sudden changes in who is on the CPU is, in part, due to the priority. However, all this time I intentionally avoided mentioning the fact that regardless of what a process' priority is, it will not run unless it is in the run queue. This is the state of SRUN. A process in the run queue does not mean that it is running, just that it can run if it has the highest priority.

Remember when I said that user processes can never have a lower priority value than 40? Well, in case you hadn't guessed (or didn't already know) system processes like sched, vhand and init almost exclusively operate with priority values less than 40. Well, if they always have a lower priority why isn't that they always get to run? Simple. They aren't always in the run queue.

Interrupts, Exceptions and Traps

Normally, processes like sched, vhand and init, as well as most user processes are sleeping waiting on some event. When that event happens, these processes are called into action. Remember, it is the responsibility of the sched process to free up memory when a process runs short of it. So, it is not until memory is needed that sched starts up. How is it that sched knows?

In chapter 1, we talked about virtual memory and I mentioned page faults. When a process makes reference to a place in its virtual memory space that does not yet exist in physical memory, a page fault occurs.

Faults belong to a group of system events called exceptions. An exception is simply something that occurs outside of what is normally expected. Faults (exceptions) can occur either before or during the execution of an instruction.

For example, if an instruction needs to be read that is not yet in memory, the exception (page fault) occurs before the instruction starts being executed. On the other hand, if the instruction is supposed to read data from a virtual memory location that isn't in physical memory, the exception occurs during the execution of the instruction. In cases like these, once the missing memory location is loaded into physical memory, the CPU can start the instruction.

Traps are exceptions that occur after an instruction has been executed. For example, attempting to divide by zero will generate an exception. However, in this case it doesn't make sense to restart the instruction since every time we to try to run that instruction, it still comes up with a Divide-by-Zero exception. That is, all memory references are read in before we start to execute the command.

It is also possible for processes to generate exceptions intentionally. These programmed exceptions are called software interrupts.

When any one of these exceptions occurs, the system must react to the exception. In order to react, the system will usually switch to another process to deal with the exception. This means a context switch. In our discussion of process scheduling, I mentioned that every clock tick the priority of every process is recalculated. In order to make those calculations, something other than those processes have to run.

In both ODT 3.0 and OpenServer the system timer (or clock) is programmed to generate a hardware interrupt 100 times a second. (This is defined by the HZ system parameter) The interrupt is accomplished by sending a signal to a special chip on the motherboard called an interrupt controller. (We go into more details about these in the chapter on hardware) The interrupt controller then sends an interrupt to the CPU. When the CPU gets this signal it knows that the clock tick has occured and it jumps to a special part of the kernel that handles the clock interrupt. Scheduling priorities are also recalculated within this same section of code.

Because the system might be doing something more important when the clock generates an interrupt, there is a way to turn them off. In other words, there is a way to mask out interrupts. Interrupts that can be masked out are called maskable interrupts. An example of something more important than the clock would be accepting input from the keyboard. This is why clock ticks are lost on systems with a lot of users, inputting a lot of data. As a result the system clock appears to slow down over time.

Sometimes events occurs on the system that you want to know about no matter what. Imagine what would happen if memory was bad. If the system was in the middle of writing to the hard disk when it encountered the bad memory, the results could be disastrous. If the system recognizes the bad memory, the hardware generates an interrupt to alert the CPU. If the CPU was told to ignore all hardware interrupts, it would ignore this one. Instead, the hardware has the ability to generate an interrupt that cannot be ignored or masked out. This is called a non-maskable interrupt. Non-maskable interrupts are generically referred to as NMIs.

When an interrupt or an exception occurs, it must be dealt with to ensure the intergrity of the system. How the system reacts depends on whether it was an exception or interrupt. In addition, what is done when the hard disk generates an interrupt is going to be different than when the clock generates one.

Within the kernel is the Interrupt Descriptor Table (IDT). This is a list of descriptors (pointers) that point to the functions that handles the particular interrupt or exception. These functions are called the interrupt or exception handlers. When an interrupt or exception occurs, it has a particular value called an identifier or vector. Table 0.2 contains as list of the defined interrupt vectors. For more information see <sys/trap.h>.


Identifier

Description

0

Divide error

1

Debug exception

2

Nonmaskable interrupt

3

Breakpoint

4

Overflow

5

Bounds check

6

Invalid opcode

7

Co-processor not available

8

Double fault

9

(reserved)

10

Invalid TSS

11

Segment not present

12

Stack exception

13

General protection fault

14

Page fault

15

(reserved)

16

Co-processor error

17

alignment error (80486)

18-31

(reserved)

32-255

External (HW) interrupts

Table 0.2 Interrupt Vectors

The reserved identifiers currently are not used by the CPU, but are reserved for possible future use. Interrupts that come from one of the interrupt controllers the kernel assigns to the identifiers 64 through 79. Identifiers 32-63 and 80-255 are not currently used by SCO UNIX. These identifiers are often referred to as vectors and the interrupt descriptor table (IDT) is often referred to as the interrupt vector table.

What these numbers really are indices into the IDT. When an interrupt, exception or trap occurs, the system knows which number corresponds to that event. It then uses that number as an index into the IDT, which in turn points to the appropriate area of memory for handling the event. What this looks like graphically is shown in Figure 0-7.

It is possible for devices to share interrupts. That is, there are multiple devices on the system that are configured to the same interrupt. In fact, there are certain kinds of computers that are designed to allow devices to share interrupts (we'll talk about them in the hardware section). If the interrupt number is an offset into a table of pointers to interrupt routines, how does the kernel know which one to call?

Well, as it turns out there are two IDTs; one for shared interrupts and one for non-shared interrupts. During a kernel relink (more on that later), the kernel determines if the interrupt is shared or not. If it is, it places the pointer to that interrupt routine into the shared IDT. When an interrupt is generated, the interrupt routine for each of these devices is called. It is up to the interrupt routine to check to see if the associated device really generated an interrupt or not. The order that they are called is the order that they are linked in.

When an exception happens in user mode, the process passes through something called a trap gate. At this point, the CPU no longer uses the process' user stack, but rather the system stack within that process' uarea (each uarea has a portion set aside for the system stack). At this point, that process is operating in system (kernel) mode. That is, at the highest privilege level, 0.

Before the actual exception can be handled, the system needs to ensure that the process can return to the place in memory where it was when the exception occured. This is done by a low level interrupt handler. Part of what it does is to push (copy) all of the general purpose registers onto the process' system stack. This makes them available again when the process goes back to using the user stack.

The low level interrupt handler also determines whether the exception occurred in user mode or system mode. If the process was already in system mode when the exception occurred, there is no need to push the registers onto the process' system stack, as this is the stack that the process is already using.


Figure 0-7 First-Level Interrupt Handler

The kernel treats interrupts very similarly to the way it treats exceptions. All of the general purpose registers are pushed onto the system stack and a common interrupt handler is called. The current interrupt priority is saved and the new priority is loaded. This prevents interrupts at lower priority levels from interrupting the kernel as it is handling this interrupt. Then the real interrupt handler is called.

Since an exception is not fatal, the process will return from whence it came. It is possible that immediately upon return from system mode, a context switch occurs. This might be the result of an exception with a lower priority. Since it could not interrupt the process in kernel mode, it had to wait until it returned to user mode. Since the exception has a higher priority than the process when it is in user mode, a context switch occurs immediately after the process returns to user mode.

If another exception occurs while the process is in system mode, this is not a normal occurrence. Exceptions are the result of software events. Even a page fault can be considered a software event. Since the entire kernel is in memory all the time, a page fault should not happen. When a page fault does happen when in kernel mode, the kernel panics. There are special routines built into the kernel to deal with the panic to help the system shutdown as gracefully as possible. Should something else happen that causes another exception while the system is trying to panic, a double panic occurs.

This may sound confusing as I just said that a context switch could occur as the result of another exception. What this means is that the exception occurred in user mode, so there needs to be a jump to system mode. This does not mean that the process continues in system mode until it is finished. It may (depending on what it is doing) be context switched out. If another process has run before the first one gets its turn on the CPU again, that process may generate the exception.

There are a couple of cases where exceptions in system mode do not cause panics. The first is when you are debugging programs. In order to stop the flow of the program, exceptions are raised and you are brought into special routines within the debugger. Since exceptions are expected in such cases, it doesn't make sense to have the kernel panic.

The other case is when page faults occur as data is being copied from kernel memory space into user space. As I mentioned above, the kernel is completely in memory. Therefore, the data will have to be in memory to copy from the kernel space. However, it is possible that the area that the data needs to be copied to is not in physical memory. Therefore a page fault exception occurs, but this should not cause the system to panic.

Unlike exceptions, it is possible that another interrupt occurs while the kernel is handling the first one (and therefore is in system mode). If the second interrupt has a higher priority than the first, a context switch will occur and the new interrupt will be handled. If the second interrupt has the same or lower priority, then the kernel will "put it on hold”. These are not ignored, but rather saved to be dealt with later.

Signals

Signals are a way of sending simple messages to processes. Most of these messages are already defined and can be found in <sys/signal.h>. However, signals can only be processed when the process is in user mode. If a signal has been sent to a process that is in kernel mode, it is dealt with immediately upon returning to user mode.

Many signals have the ability to immediately terminate a process. However, most of these signals can be either ignored or dealt with by the process itself. If not, the kernel will take the default action specified for that signal. You can send signals to processes yourself by means of the kill command as well as the delete key and Ctrl-/. However, you can only send signals to processes that you own. Or, if you are root, you can send signals to any process.

It's possible that the process that you want to send the signal to is sleeping. If that process is sleeping at an interruptable priority, then the process will awaken to handle the signal. Processes sleeping at an priority of 25 or less are not interruptable. The priority of 25 is called PZERO.

The kernel keeps track of pending signals in the p_sig entry in each process' process structure. This is a 32-bit value, where each bit represents a single signal. Since it is only one bit per signal, there can only be one signal pending of each type. If there are different kinds of signals pending, the kernel has no way of determining which came in when. It will therefore process the signals starting at the lowest numbered signal and moving up.

System Calls

If you are a programmer, you hopefully know what a system call is and have used them many times in your programs. If you are not a programmer, you may not know what they are, but you still use them thousands of times a day. All "low-level” operations on the system are handled by system calls. These include such things as reading from the disk or printing a message on the screen. System calls are the user's bridge between user space and kernel space. This also means that it is the bridge between a user application and the system hardware.

Collections of system calls are often combined into more complex tasks and put into libraries. When using one of the functions defined in a library you call a library function or make a library call. Even when the library routine is intented to access the hardware, it will make a system call long before the hardware is touched.

Each system call has its own unique identifying number that can be found in <sys.s>. The kernel uses this number as an index into a table of system call entry points. These are pointers to where the system calls reside in memory along with the number of arguments that should be passed to them.

When a process makes a system call, the behavior is similar to that with interrupts and exceptions. Entry into kernel space is made through a call gate. There is a single call gate which serves as a guardian to the sacred area of kernel space. Like exception handling, the general purpose registers and the number of the system call are pushed onto the stack. Next, the system call handler is invoked, which calls the routine within the kernel that will do the actual work.

After entering the call gate, the kernel system call dispatcher "validates" the system call and hands control over to the kernel code that will actually perform the requested function. Although there are hundreds of library calls, each of these will call one or more systems calls. In total, there are about 150 system calls, all of which have to pass through this one call gate. This ensures that user code moves up to the higher privilege level at a specific location within the kernel (a specific address). Therefore, uniform controls can be applied to ensure that a process is not doing something it shouldn't.

When the system call is complete, the system call dispatcher returns the result of the system and status codes (if applicable). As with interrupts and exceptions, the system checks to see if a context switch should occur upon the return to user mode. If so, a context switch takes place. This is possible in situations where one process made a system call and an interrupt occurred while the process was in system mode. The kernel then issued a wakeup() to all processes waiting for date from the hard disk.

When the interrupt completes, the kernel may go back to the first one that made the system call. But, then again, there may be another one with a higher priority.

Paging and Swapping

In chapter 1, we talked about how the operating system uses capabilities the CPU to make it appear as if you have more memory than you really do. This is the concept of virtual memory. In chapter 12, we'll go into details about how this is accomplished. That is, how the operating system and CPU work together to keep up this illusion. However, in order to make this section on the kernel complete, we need to talk about this a little from a software perspective.

One of the basic concepts in the SCO UNIX implementation of virtual memory is the concept of a page. A page is a 4Kb area of memory and is the basic unit of memory that both the kernel and the CPU deal with. Although both can access individual bytes (or even bits), the amount of memory that is managed is usually in pages.

If you are reading a book, you do not need to have all the pages spread out on a table for you to work effectively. Just the one you are currently using. I remember many times in college when I had the entire table top covered with open books, including my notebook. As I was studying I would read a little from one book, take notes on what I read and if I needed more details on that subject, I would either go to a different page or a completely different book.

Virtual memory in SCO UNIX is very much like that. Just as I only need to have open the pages I was currently working with, a process needs only to have those pages in memory that it is working with. Like me, if the process needs a page that is not currently available (not in physical memory), it needs to go get it (usually from the hard disk).

If another student came along and wanted to use that table, there might be enough space for him or her to spread out books as well. If not, I would have to close some of my books (maybe putting book marks at the pages I was using). If another student came along or the table was fairly small, I might have to put some of the books away. SCO UNIX does that as well. If the text books represent the unchanging text portion of the program and the notebook represents the changing data, things might be a little clearer.

It is the responsibility of both the kernel and the CPU to ensure that I don't end up reading someone else's textbook or writing in someone else's notebook. That is, they ensure that one process does not have access to the memory locations of another process (a discussion of cell replication would look silly in my calculus notebook). The CPU also helps the kernel by recognizing when the process tries to access a page that is not yet in memory. It is the kernel's job to figure out which process it was, what page is was and to load the appropriate page.

It is also the kernel's responsibility to ensure that no one process hogs all available memory. Just like the librarian telling me to make some space on the table. If there is only one process running (not very likely) then there may be enough memory to keep the entire process loaded as it runs. More likely is the case where there are dozens of processes in memory and each gets a small part of the total memory. (Note, depending on how much memory you have, it is still possible that the entire program is in memory.)

Processes generally adhere to the principle of spatial locality. This means that over a short period of time, processes will access the same portions of their code over and over again. The kernel could establish a working set of pages for each process. These are the pages that have been accessed with the last n memory references. If n is small, then processes may not have enough pages in memory to do their job. Instead of letting processes work, the kernel is busy spending all of its time reading in the needed pages. By the time the system has finished reading in the needed pages, it is some other processes turn. Now, some other process needs more pages, so the kernel needs to read them in. This is called thrashing. Large values of n may lead to cases where there is not enough memory for all the processes to run.

However, SCO UNIX does not use the working set model, but does use the concept of a window. When the amount of available (free) memory drops below a certain point (which is configurable), the vhand process is woken up and put on the run queue. Using this windows does not mean that thrashing cannot occur on an SCO system. When memory gets so full of user processes that the system spends more time freeing up memory and swapping processes in and out, even SCO will thrash.


Figure 0-8 sched and vhand working together

As I mentioned above, the number of free pages is checked at every clock tick. If this gets below the value set by the kernel tunable GPGSLO, vhand, the "page stealer” process, is woken up. When it runs, vhand searches for pages that have not been recently accessed by a process.

If a page has not been referenced within a pre-determined time, it is "freed” by vhand and added to a list of free pages. In order to keep one area from having pages stolen more often than others, vhand remembers where it was and starts with a different area the next time it runs. When vhand has freed up enough pages so there are more than GPGSHI pages available, it puts itself back to sleep. This is the reason why vhand does not always run, even though it has a high priority.

If a page that vhand steals is part of the executable portion of a process (the text), it can easily free up the page, since it can always get it back from the hard disk. However, if the page is part of the process' data this may be the only place that data exists. The simplest solution is to just say that data pages cannot be stolen. This is a problem, since you may have a program that needs more data than there is physical memory. Since you cannot keep it all in memory at the same time, you have to figure out a better solution.

The solution is to use a portion of hard disk as a kind of temporary storage for data pages that are not currently needed. This area of the hard disk is called the swap space or swap device and is a separate area used solely for the purpose of holding data pages from programs. Copying pages to the swap space is the responsibility of the system process swapper.

The size and location of the swap device is normally set when the system is first installed. Afterwards, more swap space can be added if needed. (This is done with the swap command.) Occupied and free areas of the swap device are managed by a map, where a zero value says the page on the swap device is free and non-zero value is the number of processes sharing that page.

If the system is short of memory and pages need to be swapped out, the swapper process needs to determine just what processes can be swapped out. It first looks for processes in the states of either SSLEEP (process is sleeping) or SSTOP. (Process is stopped by a debugger) If there is only one, the choice is easy, if not, the swapper needs to calculate which one has the lowest priority.

Often times, there are no processes that fit these criteria. Although this usually happens only on systems that are heavily loaded, the system needs to take into account cases where there are no processes in either SSLEEP or SRUN. Therefore, the swapper needs to look elsewhere to find a process to swap out. It then considers processes in the state SRUN (read-to-ready) or SXBRK (needs more memory) and tries to find the processes with the lowest priority in these states.

If the swapper is trying to make room for a process that is already swapped out, then that process must have been on the swap device for at least two seconds. The two second threshold is to keep the system from spending all of its time thrashing and not getting any work done.

If we find a suitable process to swap out, sched locks the process in memory. Hmmm. Why lock a process into memory that we are just going to swap out? Remember that sched is just another process. It could be context switched out if an interrupt (or something else) occurs. What happens if sched gets context switched out, but vhand runs before the swapper gets a chance to run again? It could happen that vhand steals pages from the first process. When vhand gets switched back in, the pages it wanted to steal may already be gone. Either sched has to again figure out what pages are to be swapped out, or vhand steals pages that sched just brought in. Instead it locks the pages in memory.

Next, space is allocated on the swap device for the process' uarea. If there is no space available, the swapper generates an error indicating swap space is running low and will try to swap out other parts of the process. If it can't, the system will panic with an "out of swap” message. Remember that a panic is when something happens that the kernel does not know how to deal with. Since the kernel has a process that needs to run and it cannot make more memory available by swapping out a process, the kernel doesn't know what to do. Therefore it panics.

All regions of the process are checked. If a region is locked into memory it will not be swapped. If a region is for private use, such as data or stack, all of the region will be swapped out. If the region is shared, only the unreferenced pages will be swapped. That is, only those pages that have not been referenced within a certain amount of time wiith be swapped.

Note that swapping may not require a physical write to the swap device. This is due to the fact that once an area is allocated for a process, it remains so allocated until the process terminates. Therefore, it can happen that a page is swapped back in to be read, but never written to. If that page later needs to be swapped out again, there is no need to swap it out as it already exists in the correct state on the hard disk.

Eventually the process that got swapped out will get a turn on the CPU and will need to be swapped back in. Before it can be swapped back in, the swapper needs to ensure that there is enough memory for at least the uarea and a set of structures called page tables. Pages tables are an integral part of the virtual memory scheme and point to the actual pages in memory. We talk more about this when we talk about the CPU in the hardware section.

If there isn't enough room, sched looks for the process that has been waiting for memory the longest (SXBRK) and tries to allocate memory for that process. If there are none in this state, sched goes through the same procedures as it does for processes that are already in main memory. It's possible for sched to make memory available to other processes in addition to the original one. Up to ten processes in the SXBRK state can have memory allocated and up to five can be swapped back in during a pass of the swapper.

Often you don't want to swap in certain process. For example, it doesn't make sense to swap in a process that is sleeping on some event. Since that event hasn't occurred yet, swapping it in means that it will just need to go right back to sleep. Therefore, only processes in the SRUN state are eligible to be swapped back in. That is, only the processes that are runnable are swapped back in. In addition, these processes must have a priority less than 60. If it has a higher priority value (therefore a lower priority) the odds are that there are other processes that will be run first. Since you are already swapping, it is more than likely that this process will be swapped out again.

When being swapped back in, the pages that are not page of the uarea or page tables are left on the swap device. It is not until the process actually needs them that they will be swapped back in. This will happen in pretty short order, since anything the process wants to do will cause a page fault and cause pages to be swapped back in.

Keep in mind that accessing the hard disk is hundred of times slower than accessing memory. Although swapping does allow you to have more programs in memory than the physical RAM will allow, using it does slow down the system. If possible, it is a good idea to keep from swapping by adding more RAM.

Processes in Action

If you are like me, knowing how things work in theory is not enough. You want to see how things are working on your system. SCO provides several tools for you to watch what is happening. The first is perhaps the only one that the majority of users have ever seen. This is the ps command, which gives you the process status of particular processes. Depending on your security level normal users can even look at every process on the system. (The must have the mem subsystem privilege).

Although users can look at processes using the ps command, they cannot look at the insides of the processes themselves. This is because the ps command is simply reading the process table. This contains only the control and data structures necessary to administer and manager the process and not the process itself. Despite this, using ps can not only show you a lot about what your system is doing, but it can give you insights into how the system works. Because much of what I will talk about is documented in the ps man-page, I want to suggest in advance that you take a look there for more details.

If you start ps from the command with no options, the default behavior is to show the processes running on our current terminal, something like this:

PID

TTY

TIME

CMD

608

ttyp0

00:00:02

ksh

1147

ttyp0

00:00:00

ps


This shows us the process ID (PID), the terminal that the process is running on (TTY), the total amount of time the process has had on the CPU (TIME) and the command that was run (CMD). If the process had already issued an exit(), but hadn't finished it yet by the time the ps read the process table, we would probably see <defunct> in this column.

Although this is useful in many circumstances, it doesn't say much about these processes, Let's see what the long output looks like. This is run as ps -l.


F

S

UID

PID

PPID

C

PRI

NI

ADDR

SZ

WCHAN

TTY

TIME

CMD

20

S

0

608

607

1

73

24

fb11b9e8

140

fb11b9e8

ttyp0

00:00:02

ksh

20

O

0

1172

608

14

42

24

fb11c0a0

184

-

ttyp0

00:00:00

ps

Now this output looks a little better. At least there are more entries, so maybe it is more interesting. The columns PID, TTY, TIME and CMD are the same as in the previous output.

The first column (F) are flags in octal to tell us some information about the state of the process. For example, a 01 here would be for a system process that is always in memory, such as vhand or sched. The 20 in both cases here mean that the process is in main memory.

The S column is one of the "official" states that the process can be in. These states are defined in <sys/proc.h> and can be one of the following values:


  • O

Process is currently on the processor (SONPROC)

  • S

Sleeping (SSLEEP)

  • R

Ready to run (SRUN)

  • I

Idle, being created (SIDL)

  • Z

Zombie state (SZOMB)

  • T

Process being traced, used by debuggers (SSTOP)

  • B

Process is waiting for more memory (SXBRK).

Here we see that the ksh process (line 1) is sleeping. Although we can't tell from the output, I know that the event, which is waiting on is the completion on the ps command. One indication I have is the PID and PPID columns. These are, respectively, the Process ID and Parent Process ID. Notice that the PPID of the ps process is the same as the PID of the ksh process. This is because I started the ps command from the ksh command line and the ksh had to do a fork()-exec() to start up the ps. This makes ps  a child process of the ksh. Since I didn't start the ps in the background, I know the ksh is waiting on the completion of the ps. (More on this in a moment)

We see that the ps process is on the processor (state O-SONPROC). As a matter of fact, I have never run a ps command where ps was not on the processor. Why? Well, the only way for ps to read the process table is to be running and the only way for a process to be running is to be on the processor.

Since I just happened to be running these processes as root, the UID column, which shows the User ID of the owner of that processes, there is 0 in this column. The owner is almost always the user that started the process. However, you can change the owner of a process by using the setuid() or the seteuid() system call.

The C column is an estimate of recent CPU usage. Using this value, combined with the process' priority (the PRI column) and the nice value (the NI column), sched calculates the scheduling priority of this process. The ADDR column is the virtual address of that process' entry in the process table. The SZ column is the size (in kilobytes) of the swappable portion of the process' data and stack.

The WCHAN column is the Wait CHANnel for the process. This is the event that the process is waiting on. Since ps is currently running, it is not waiting on an any event. Therefore, there is a dash in this column. The WCHAN that the ksh is waiting on is fb11b9e8. Although, I have nothing here to prove it, I know that this event is the completion of the ps.

Although I can't prove this, I can make some inferences. First, let's look at the ps output again, this time let's start ps in the background. This gives us the output:


F

S

UID

PID

PPID

C

PRI

NI

ADDR

SZ

WCHAN

TTY

TIME

CMD

20

S

0

608

607

3

75

24

fb11b9e8

132

f01ebf4c

ttyp0

00:00:02

ksh

20

O

0

1221

608

20

37

28

fb11cb60

184

-

ttyp0

00:00:00

ps

Next, let's make use of the ability of ps to display the status of processes running on a specific terminal. We can then run the ps command from another terminal and look at what's happening on ttyp0. Running the command

ps -lt ttyp0

we get something like this:


F

S

UID

PID

PPID

C

PRI

NI

ADDR

SZ

WCHAN

TTY

TIME

CMD

20

S

0

608

1295

3

75

24

fb11b9e8

132

f01ebf4c

ttyp0

00:00:02

ksh

In the first example, ksh did a fork-exec, but since we put it in the background it returned to the prompt and didn't wait for the ps complete. Instead it was waiting for more input from the keyboard. In the second example, ksh did nothing. I ran the ps from another terminal and it showed me only the ksh. Looking back at that screen, I see that it is sitting there, waiting for input from the keyboard. Notice that in both cases the WCHAN is the same. Both are waiting for the same event: input from the keyboard. However, in the previous example we did not put the command in the background so the WCHAN was the completion of ps.


Despite it's ominous name, another useful tool is crash. Not only can you look at processes, but you can also look at many other things such as file tables, symbol tables, region table, both the global and local descriptor tables, and even the uarea of a process. Because crash needs read access to both /dev/mem and /unix, it can only be run by root. This is because the crash command needs to be able to read /dev/mem. If a normal user were allowed to read /dev/mem, they could read another user's process.

In the chapter on system monitoring, we'll talk about crash and all the things it can tell us about out system.

Devices and Device Nodes

In

UNIX nothing works without devices. I mean NOTHING. Getting input from a keyboard or displaying it on your screen both require devices. Accessing data from the hard disk or printing a report also require devices. In an operating system like DOS, all of the input and output functions are almost entirely hidden from you. Drivers for these devices must exists in order to be able to use them, however they are hidden behind the cloak of the operating system.

Although accessing the same physical hardware, device drivers under UNIX are more complex than their DOS cousins. Although adding new drivers is easier under DOS, SCO provides more flexibility in modifying the ones you have. SCO UNIX provides a mechanism to simplify adding these input and output functions. There is a set of tools and utilities to modify and configure your system. These tools are collectively called the Link Kit, or link kit.

The link kit is part of the extended utilities packages. Therefore, it is not a required component of your operating system. Although having the link kit on your system is not require for proper operation, you are unable to add devices or change kernel parameters without it.

Because the link kit directly modifies configuration files and drivers that are combined to create the operating system, a great deal of care must be exercised when making changes. If you are lucky, an incorrectly configured kernel just won't boot rather than trashing you hard disk. If you're not ... Well, is your resumé up to date?

Major and Minor Numbers

To UNIX, everything is a file. To write to the hard disk you write to a file. To read from the keyboard is to read from a file. To store backups on a tape device is to write to a file. Even to read from memory, is to read from a file. If the file you are try to read from, or write to, is a "normal" file, the process is fairly easy to understand. The file is opened and you read or write data. If, however, the device being accessed is a special device file (also referred to as a device node), a fair bit of work needs to be done before the read or write operation can begin.

One of the key aspects of understanding device files lies in the fact that different devices behave and react differently. There are no keys on a hard disk and no sectors on a keyboard. However, you can read from both. The system, therefore, needs a mechanism whereby it can distinguish between the various types of devices and behave accordingly.

In order to access a device accordingly, the operating system needs to be told what to do. Obviously the manner in which the kernel accesses a hard disk will be different from the way it accesses a terminal. Both can be read from and written to, but that's about where the similarities end. In order to access each of these totally different kinds of devices, the kernel needs to know that they are, in fact, different.

Inside the kernel are functions for each of the devices the kernel is going to access. All the routines for a specific device are jointly referred to as the device driver. Each device on the system has its own device driver. Within each device driver are the functions that are used to access the device. For devices such as a hard disk or terminal, the system needs to be able to (among other things) open the device, write to the device, read from the device and close the device. Therefore, the respective drivers will contain the routines needed to open, write to, read from and close those devices. (Among other things)

In order to determine how to access the device, the kernel needs to be told. Not only what does the kernel need to be told what kind of device is being access, but any special information such as the partition number if its a hard disk or density if it's a floppy, for example. This is accomplished by the major and minor number of that device.

The major number is actually the offset into the kernel's device driver table, which tells the kernel what kind of device it is. That is, whether it is a hard disk or a serial terminal. The minor number tells the kernel special characteristics of the device to be accessed. For example, the second hard disk has a different minor number than the first. The COM1 port has a different minor number than the COM2 port.

Figure 0-9 Process major and minor numbers

It is through this table that the routines are accessed that, in turn, access the physical hardware. Once the kernel has determined what kind of device it is talking to, it determines the specific device, the specific location or other characteristics of the device by means of the minor number.

In order to find out what functions can be used to access a particular device, you can take a look in the /etc/conf/cf.d/mdevice file, which contains a list of all the devices in the system. Aside from the function list, mdevice also contains the major numbers for that device. For details on the mdevice file, take a look at the mdevice(F) man-page.

So how do we, let alone the kernel, know what the major and minor number of a device are? By doing a long listing of the /dev directory (either with an l or an ls -l), there are two things that tell us that the files in this directory are not normal files. One thing to look at is the first character on each line. If these were regular files, the first character would be a -. In /dev almost every entry starts with either a b or a c. These represent, respectively, block devices and character devices. (The remaining entries are all directories and begin with a d. See the ls(C) man-page for additional details on the format of these entries.)

The second indicator that these are not "normal" files is the fifth field of the listing. For both regular files and directories, this field shows their size. However, device nodes do not have a size. The only place they exist is their inode (and, of course, the corresponding directory entry). There are no data blocks taken up by the device file, therefore it has no size. For device nodes, there are two numbers instead of one for the size. These are, respectively, the major and minor number of the device.

Like file sizes, major and minor numbers are stored in the file's inode. In fact, they are stored in the same field of the inode structure. File sizes can be up to 2147483648 bytes (2 Gigabytes or 231 bytes), but major and minor number are limited to a single byte each. Therefore, there can be only 256 major numbers and 256 minor numbers per major.

As I mentioned before, the major number corresponds to whatever is listed in column 5 or 6 of mdevice. Once we have the name of a device, we can scan mdevice to find the name of the corresponding driver. Unlike some dialects of UNIX, SCO has made figuring out what each device does a little easier. Keep in mind that the system does this all internally and does not read mdevice while it is running.

For the most part, the names of devices provide some clue as to their function. Let's take a look at a few to get a better feel for what the names mean. I am going to go into a fair bit of detail about how hard disks are put together since they are the most commonly accessed devices and cause the most problems. Additionally, the hard disk numbering scheme provides a good demonstration of how minor numbers are used.

First, change directories to /dev (cd /dev) and do a long listing of a few files with the command:


l hd0* (Don't forget the asterisk)

If we have a typical system, this give us entries that look like this:

brw-------

2

sysinfo

sysinfo

1,

0

Mar

23

1993

hd00

brw-------

2

sysinfo

sysinfo

1,

15

Mar

23

1993

hd01

brw-------

2

sysinfo

sysinfo

1,

23

Mar

23

1993

hd02

brw-rw-rw-

2

sysinfo

sysinfo

1,

31

Mar

23

1993

hd03

brw-rw-rw-

2

sysinfo

sysinfo

1,

39

Mar

23

1993

hd04

brw-------

2

sysinfo

sysinfo

1,

47

Feb

26

1994

hd0a

brw-r-----

3

dos

sysinfo

1,

48

Mar

19

1994

hd0d

Immediately, we can tell these are block devices by looking at the first character on each line. If we look at the major number ('1' in each case), we see that each of the listed devices has the same major device number. Therefore, we know that the devices we listed are all accessed using the same driver. In this case it is the device in position 1 of the driver table. If we take a look at mdevice, we find that this is the 'hd' driver and as you probably guessed, this is the hard disk driver. If we look at the name of each device (the last column) we see that they all begin with hd. This is not a coincidence.

Before we talk about what each of these files represent, let's do another long listing of some other files. This time lets do a l rhd*. The observant readers might have noticed that this is almost identical to the command we issued before. However, the extra 'r' gave us a slightly different output:

crw-------

2

sysinfo

sysinfo

1,

0

Mar

23

1993

rhd00

crw-------

2

sysinfo

sysinfo

1,

15

Mar

23

1993

rhd01

crw-------

2

sysinfo

sysinfo

1,

23

Mar

23

1993

rhd02

crw-------

2

sysinfo

sysinfo

1,

31

Mar

23

1993

rhd03

crw-------

2

sysinfo

sysinfo

1,

39

Mar

23

1993

rhd04

crw-------

2

sysinfo

sysinfo

1,

47

Feb

26

1994

rhd0a

crw-r-----

3

dos

sysinfo

1,

48

Mar

23

1993

rhd0d

If we look carefully, each line in this output, differs from the first only in two letters. As we see, the first character on each line is a 'c', representing a character device. In addition, each name is preceded with an 'r'. (That obviously shows up because that's what we listed) This 'r' means that this is a 'raw' device. That tells us that system reads the device directly and does not use the buffer cache. (The hardware may do some caching of the data, but the operating system does not). The different devices are more commonly referred to as block and character devices, although the character devices are often referred to as 'raw' devices as there is no caching of their input.

We need to side-step a little here. I mentioned earlier that the kernel maintains a table containing each of the configured devices. Well this is only half true. The kernel maintains two tables, one for each type of device: block and character. The table for the block devices is called the block device switch table (bdevsw) and the table for the character devices is called the character device switch table (cdevsw). The offset into these table is the major device. Saying that these are offsets into a table is true, but it is an over simplification. Each entry in the table is actually a structure of pointers to the different functions that are used to access the device.

As I mentioned, there are only two differences in each line of the ls output above. The major and minor numbers are unchanged. Therefore, not only is the same driver being used, but in each case, the same flags are being passed to the driver.

By now, I am sure you are asking "Just what do all those minor numbers mean?" (Assuming, of course, you don't already know) Well, as I mentioned, they are flags to the device driver to tell it where to look. The driver knows that this is a hard disk. But since the hard disk can be divided into multiple partitions, an important question to the driver is "Which partition?" An even better question might be "What hard disk?", since more than one disk can be configured on the system.

As you might have guessed, which disk is accessed is handled by the minor number. Since each minor number is represented in the inode as a single byte it can only be in the range of 0-255. If we look in /dev, we find no device with a minor number (or major number) greater than 255. Therefore, there can only be only 256 devices of a particular type. Appendix A contains a quick review of binary counting. Use this as a guide when trying to match minor numbers with the location on the hard disk.

Having 256 hard disks seems like a lot and it doesn't seem all that common for someone to have so many hard disks. The problem lies in the fact that a hard disk minor number does not just tell what hard disk is being accessed, but also what partition and what filesystem. To be able to handle and process this information in an orderly fashion, the system needs some method of encoding it.

This is easily accomplished by accessing the byte that represents the minor as a series of eight bits. In fact, this is exactly what the hard disk (and most other drivers) do.Table 0.3 contains the break down of the bits for hard disk minor numbers.


Bits

Description








7

6

5

4

3

2

1

0



X

X

-

-

-

-

-

-

disk # (0-3)


-

-

X

X

X

-

-

-

partition # (1-4)


-

-

-

-

-

X

X

X

division# (0-7)


-

-

X

X

X

1

1

1

whole partition


-

-

0

0

0

0

0

0

whole physical disk



-

1

0

1

-

-

-

active partition


-

-

1

1

0

-

-

-

DOS partition (ie. hd0d)


-

-

1

1

0

X

X

X

DOS partition (C-J)


Table 0.3 Hard disk minor number bit scheme

In this scheme, the two high-order bits (7 & 6) tell us what hard disk is beginning accessed. Considered individually, we have the number 0-3 Exactly like the drive numbers they represent. However, considered in respect to the entire minor number, this represents the values 27 and 26 (128 and 64).

Depending on which bit is set, we end up with the four ranges. If neither bit 6 or 7 is set, the total value of these two bits is 0 and this would be drive 0. This gives us the range of minor numbers 0-63, since this is as high as we can go with bit 0-5. If bit 6 is set we add 64 to this range (26=64), so the range of minor numbers is 64-127 (0+64=64, 63+64=127). If bit 7 is set we add 128 to this range (27=128), so the range of minor numbers is 128-191 (0+128= 128, 63+128=191). Lastly, if bits 6 and 7 are set we add 192 to this range (64+128=192), so the range of minor numbers is 192-255 (0+192=192, 63+192=255).

From these numbers we see that any reference to a major number of 1 and a minor number between 0-63 is accessing something on the first hard disk. Anything with a minor number 64-127 is accessing something on the second hard disk, and so on.

The next bits (bits 3-5) represent the partition on the respective hard disk. At first, this seems odd, since there can only be four partitions on a drive and these three bits can represent 8 different values. Four of the values are used to represent the individual partitions. Sometimes, it is necessary to refer to the entire disk, regardless of what partitions exist (for example to write the masterboot block). Therefore, a fifth value is needed. This would be when each off these three bits are off.

It is possible that any one of the four partitions can be active. SCO UNIX provides a means of address the active partitions, regardless of how many partitions there are and where they exist. The notion of the active partition is also one of the eight, therefore another set of numbers is used. This is when bit 3 and 5 are set. Lastly, since SCO UNIX can access DOS partitions, the systems needs to know that a partition is DOS as well. This is when bit 4 and 5 are set.

If we've counted right, this adds up to 7 different values. The eighth one is simply not used. The reason being is the numbering scheme itself. In order to keep things simple, three specific bytes were used for the partition. Since there were not enough specific values to fill all eight, but too many for two bits, one is simply ignored. This is the case when all three bits (3-5) are set.

The actual partition numbers are based on combinations of bits (3-5). The partition number is simply the numeric value of the three bits taken by themselves. Since there are 3 bits, this can be the range 0-8. So, for example, if bit 3 is set, that would be equivalent to bit 0, if there were only three bits. This is partition 1 (20=1). If bit 4 was set, this would be equivalent to bit 1 being set. This is partition 2 (20=2). And so on.

(Personally, I think it was a wise decision not to start counting the partitions at 0 like with many other things. DOS doesn't. Therefore, if there were a DOS partition on a system, it might get confusing as to which partition was which.)

The last three (lower-order) bits represent the filesystems (also referred to as divisions. Here again, there are eight distinct values. However, there can only be seven divisions within a partition. The eighth value represents the entire partition. This is necessary to write the superblock as well as for certain database applications that require an entire hard disk partition to themselves. Should the partition be DOS, then the last three bytes represent the logical partition (Drives C-J).

In the ls -l output above, we see the representation of the individual partitions on the hard disk. The breakdown of the individual devices is somewhat difficult to grasp at first. The device hd00 is the entire first hard disk and we see that it has a minor number of 0. Since it has no partitions, this device uses all of the values 0-7. The devices hd01-hd04 use the ranges 7-15, 16-23, 24-31 and 32-39 respectively. Divisions (filesystems) within these partitions are sequentially numbered from the start value (7, 16, 24 or 32).

From this we get that the first division of the first partition is 7. This is the start value (7 in this case) + the division number (0 in this case) or simply 7. The second division is 7+1=8, the third 7+2=9, and so on. The third division on the fourth partition would we 32+3 or 35 (Remember we started counting at 0 for the divisions).

Next, we have the active partition on the first disk. This is simply one set of eight minor numbers higher, or 40-47. (Here bits 3 and 5 are set) The entire partition is represented as the device /dev/hd0a. This is the active partition, hence the 'a'. Since we add the division number within the partition to the partition start number, the first division of the active partition would have a minor number of 40.

This might have turned a light on for some of the readers who are more familiar with SCO UNIX. If we look at the root filesystem device (/dev/root), it has a major/minor number combination of 1,40. This is no coincidence. Since the root filesystem resides in the first division of the active partition, this matches exactly with the values we just got. Since the swap device is usually the second division, according to our calculations it ends up with a major/minor combination of 1,41. If we look at /dev/swap, we see that it has just what we calculated.

On OpenServer, /dev/root is no longer 1,40. Instead it was moved to the division after /dev/swap. Therefore, it is now 1,42. However, the numbers still fit conceptually.

Should the disk have a DOS partition this is represented by the device hd0d (hence the d). This starts at minor number 48 and goes up from there. In older SCO operating systems, you could only have one DOS partition per drive. However, if we take a look at the directory /dev/dsk, we see devices 0sC-0sJ, with minor numbers 48-55. These are the DOS devices for partitions on the first hard disk.

Up to this point, we've only talked about the first hard disk. What happens with multiple hard disks? Well, the calculations are quite easy. Since each disk is given a set of 64 minor numbers, we simply add 64 for each additional hard disk to any of the above values we calculated. Therefore the entire second disk would be 0+64=64. The second filesystem on the active partition of the third drive would be 41+64+64=168 and so on. For additional details, take a look at the hd(HW) man-page.

From this scheme, we see that there are only four hard disks of any given type possible on the system. Several years ago, this seemed to be enough. However, as systems got larger, this became a serious bottle-neck. The result was the implementation of a scheme called extended minor numbers. To understand this, we need some more background, so I will postpone the discussion until the end of the chapter.

So what does this all have to do with the kernel? Well, when accessing any one of the hard disk partitions or filesystems, the kernel ahs to know how where it is on the disk. By using the minor number it is able to make that determination.

Since hard disks are the only devices that have partitions and divisions, they are obviously the only ones that require a scheme like this. However, other devices use the bits in the minor number in similar ways. In fact, all well-behaved devices follow it to some extent.

If we take a look at the devices representing the first floppy drive, we see some similar patterns. If we change into the /dev directory, An l fd0* gives us:

brw-rw-rw-

5

bin

bin

2,

60

Nov

09

08:03

fd0

brw-rw-rw-

5

bin

bin

2,

60

Nov

09

08:03

fd0135ds18

brw-rw-rw-

4

bin

bin

2,

36

Mar

23

1993

fd0135ds9

brw-rw-rw-

5

bin

bin

2,

4

Mar

23

1993

fd048

brw-rw-rw-

3

bin

bin

2,

12

Mar

23

1993

fd048ds8

brw-rw-rw-

5

bin

bin

2,

4

Mar

23

1993

fd048ds9

brw-rw-rw-

2

bin

bin

2,

8

Mar

23

1993

fd048ss8

brw-rw-rw-

2

bin

bin

2,

0

Mar

23

1993

fd048ss9

brw-rw-rw-

6

bin

bin

2,

52

Mar

23

1993

fd096

brw-rw-rw-

6

bin

bin

2,

52

Mar

23

1993

fd096ds15

brw-rw-rw-

4

bin

bin

2,

36

Mar

23

1993

fd096ds9

As we see, the major number for floppy devices is 2 and each name begins with fd for floppy device. Following that are several characteristics about the device. Like the hard disk, the first digit is the drive number, followed by (possibly) the density (tracks-per-inch) double- or single-sided and the sectors per track. Note here that floppy device fd0 and fd0135ds18 have the same major and minor numbers. This is because the first floppy on my system is a 3.5" (135-TPI) drive.

Like the hard disks, the bit patterns tell us what drive. However, unlike the hard disk, the floppy devices start at the low-order bit. Therefore, it is bits 0 and 1 that represent the drive number. Bit two, tells us if the floppy is single or double sided (0=single and 1=double-sided). Bits 3-5 are not as easy to decipher, but still tell us the density and sectors-per-track. 96 and 135-TPI floppies have bit 5 set, 48-TPI have it unset. Drives with 8 sectors per track have bit 3 set, 9 sectors have bits 3 and 4 unset, 15 sectors have bit 4 set and 18 sectors have bits 3 and 4 set. Because of the fewer "types" of floppies as compared to the disk/partition/division combinations, bits 6 and 7 are not used.

Although floppy disks do not have the same partitions and divisions as a hard disk, the floppy device driver uses minor numbers to tell it characteristics about the device just as the hard disk driver does. These are not the only devices that use this scheme. Any time there is more than one type of device, minor numbers are used to pass information to the driver. Rather than showing you a table for each of the device types, I will just point you to the appropriate man-pages. Many have tables that can make some of the more complicated numbering schemes clearer. In other cases, simply comparing names and functionality will help to clear things up. I go into a little bit of details in the next section.

The Device Directory

Every UNIX dialect has a device directory. Lucky for us, all of them (at least all the ones I have seen) are called /dev. This gives us a little advantage when moving between platforms. Unfortunately, naming conventions are nowhere near uniform and it often takes a great deal of detective work to figure out which device does what. SCO does a good job of naming devices in a consistent and (usually) obvious manner.

Although the naming convention used by SCO makes it easy (easier than most) to figure out what each device does, it helps to have someone hold your hand and explain things. That's what I'm here to do in this section.

Before I start, I again need to point out something. The devices that I make reference to in this section may not be present your system. All the devices I mention will appear at one time or another on an SCO UNIX system. Many of them only appear if you have certain packages and products installed. This is all based on ODT 3.0 and Open Server Enterprise systems. If you do not have either of these releases or do not have some of the components installed, this section may refer to things that you do not have on your system. Don't go calling SCO Support saying you're missing things! Also, I will put off talking about the changes made specifically to OpenServer until the end of the section.

There are two ways to approach an examination of the /dev directory. We could go alphabetically through each device type and talk about what they mean. The problem is that some types of devices are not arranged alphabetically. For example, the devices xct0 and rct0 both refer to a tape device. If we were to go through alphabetically, then we would have to jump around a bit in terms of what device types we're talking about.

Instead, I chose to look at each type of device as a group. I also decided to review them somewhat in order of their significance. Some people will have different ideas on what is significant or not. However, certain assumptions can be made. It is difficult to imagine an SCO UNIX system where a user never accesses the hard disk. However, not having a mouse is a very common situation. Therefore, it makes a certain amount of sense to talk about hard disks first and if you get bored you can stop reading before you get to the section on mice.

What I intend to do here is talk about the structure of the /dev directory and what devices are represented by different sets of files. Since the /dev directory contains sub-directories as well as files, it has a certain structure to it that needs to be discussed. One problem I encountered when deciding how to approach this topic was the fact that the ODT 3.0 and OpenServer /dev/ directories are fairly different. New devices and new directories exist that we need to talk about. So, I decided to first talk about what things they have in common and then move on the what's new in OpenServer.

In a default ODT or Open Server installation, there are literally hundreds of device files. Often devices exist with different names and in different directories despite having the same major and minor number. Therefore, a simple listing is not sufficient to get the whole picture. Let's first take a look at the subdirectories of /dev. Here we are not going to follow the order that these directories appear, but rather their order of significance. (Here again, significance it relative and the choice really is arbitrary).

There exists a pair of directories /dev/dsk and /dev/rdsk. If we were to try and pronounce the names of the sub-directories we would have a clue as to their function. Each contain disk devices and if we look back at the section on major and minor numbers, we can guess that the rdsk sub-directory contains the raw disk devices. If you had bet some money on your guess you would now be up a few bucks, because that's exactly what going on.

Some of the readers who are more familiar with SCO UNIX might have asked "Why is there a sub-directory for disk devices when they already exist in /dev?" The answer is quite simple: The device files in /dev are not UNIX devices. Hmm? What are non-UNIX devices doing on a UNIX system?

To know what's going on here, we need a little history. UNIX is not the first operating system that SCO worked with. It started out with an 8086 version of XENIX, eventually moving to 80286 and 80386 versions. It wasn't until sometime after SCO had been working with XENIX for several years that it started working on UNIX.

One of the conventions used in dialects of UNIX is the device naming scheme. However, SCO had a strong following with XENIX and aside from the logic of the XENIX naming convention, changing names so abruptly was rather difficult. Following the convention, block disk devices reside in /dev/dsk and character disk devices reside in /dev/rdsk. This is not just for hard disks, but for floppy disks as well.

The next sub-directory of /dev that we are going to look at is /dev/mouse. For those readers who have already had at least one cup of coffee, it is pretty obvious to see that /dev/mouse contains devices specific to mice.

The /dev/inet directory is for files related to the inetd daemon. Rather than building routines into the applications themselves to access the different network layers, SCO has provided device nodes. For example, to access the IP layer, we have the device /dev/inet/ip.

The remaining directories are not as easy to figure out what they contain. For example, the /dev/string directory itself is unclear. If we look inside we see three devices. These represent various strings that are used or created by the kernel when the system boots. Again we will get into more detail later. The following directories are used by DOS merge with the respective functionality:

vdsp - Virtual display devices

vmouse - Virtual mouse devices

vkbd - Virtual keyboard devices

vsdsp - Slave virtual display devices

vskbd - Slave virtual keyboard devices

vems - Virtual EMS devices

Hard Disks

Let's talk now about the disk devices. In /dev, these devices exist in name-pairs. For example the first hard disk exists as /dev/hd00 and /dev/rhd00. The first one, /dev/hd00 is the character device associated with the first hard disk and /dev/rhd00 is the block device. I referred to these files as name-pairs because these have similar names, but are not the same file. No, these are not links but exist as separate files in that they take up separate inodes. This is shown by using the -i and -l option to ls as in:

ls -li /dev/*hd00

which yields:

167 brw------- 2 sysinfo sysinfo 1, 0 Mar 23 1993 /dev/hd00

162 crw------- 2 sysinfo sysinfo 1, 0 Mar 23 1993 /dev/rhd00

We can see that these files have the same major/minor number, but ifferent inode numbers. Therefore, they are different files. For more details on links, see the section on filesystems.

From our discussion of major and minor numbers, he know that the hard disk driver is major number 1. It doesn't matter if the device node is in /dev or somewhere completely different. No matter what it's name is, the key is the major and minor number. This has to be the case, otherwise the whole major/minor number scheme collapses. If we want to see all the devices using major number 1, that is all the hard disk devices, we can do:

l /dev | grep " 1," # Note that there is space before the 1

l /dev/dsk | grep " 1," # and the comma after it

l /dev/rdsk | grep " 1,"

If we look at the hard disk devices in /dev/dsk and /dev/(r)dsk we see that they all have more than one link. If we do a long listing of each set of devices, specifying that we also want the inode number we can see that each inode appears more that once. As an example, lets do l -i /dev/hd00 /dev/dsk/0s0 and we get:

167 brw------- 2 sysinfo sysinfo 1, 0 Mar 23 1993 /dev/hd00

167 brw------- 2 sysinfo sysinfo 1, 0 Mar 23 1993 /dev/dsk/0s0

then, l -i /dev/rhd00 /dev/rdsk/0s0

162 crw------- 2 sysinfo sysinfo 1, 0 Mar 23 1993 /dev/rdsk/0s0

162 crw------- 2 sysinfo sysinfo 1, 0 Mar 23 1993 /dev/rhd00

As we see, each pair are the same devices and have the same inode. However, the name-pairs /dev/hd00--/dev/rhd00 and /dev/dsk/0s0--/dev/rdsk/0s0 do not have the same inode number. The difference is that the SCO (XENIX) devices differentiate between character and block devices in their names and the SysV (UNIX) devices differentiate between them by the directory in which they appear.

Although both can be used and both access the exact same location on the hard disk, the XENIX names are more common in SCO literature. There is something here that I need to point out. The character and block devices are not the same. Not only are they accessed differently, they also have different inode numbers. Just the corresponding UNIX-XENIX pairs have the same inode number. However, all four devices have the same major and minor number combination!

Note that because of the way the system determines whether a device is a character or block device, the inodes need to be different. The kernel inode structure contains a field that specifies the type of file. This determines what ls -l outputs as the file's type. This is the first character of the permissions (b for block, c for character). Since the type of file is a single field in the inode and block and character devices are displayed differently, there must be an inode used for the block device and a separate one for its character equivalent.

As I mentioned in our discussion on major and minor numbers, each hard disk device represents a part of the hard disk. Both the UNIX and XENIX names give an indication as to where they are on the disk. With the UNIX names, the first character represents the physical hard disk and the last character represents the partition. If the last character is a 0 this signifies the entire disk. (Used to write the master boot block, among other things) So given the device 0s3, we know that it is the 3rd partition on the first (0th) physical drive. For the XENIX devices, the third character tells us the physical hard disk and the fourth character tells us the partition. Here again, the physical disks range from 0-3 and the partitions range from 1-4, with 0 reserved for the whole disk. Given the device hd31, we know this is the 1st partition on the fourth physical drive.

In both cases (UNIX and XENIX), if the last character is an a, then we know this is the active partition. Although the active partition only has special significance for the first hard disk, partitions on subsequent disks can be made active and the numbering scheme still applies. A final character of d means this a DOS partition. In older versions, SCO UNIX could not access logical DOS partitions, so a single letter was sufficient. However, as soon the drivers were in place to access multiple DOS partitions, additions needed to be made to the device node numbering and naming scheme. As a result you will find devices in /dev/dsk and /dev/rdsk for these partitions. Devices /dev/dsk0sC through /dev/dsk/0sJ represent the first eight DOS partitions per drive.

Up to this point, we've been talking about hard disk devices in very abstract terms. As you may already know, each hard disk is broken up into partitions. Any PC-based operating system needs to create at least one partition on the hard disk. SCO UNIX divides partitions up further into smaller units called divisions. The most common way of accessing the hard disk is the filesystems that reside within divisions.

Divisions do not necessarily have to have filesystems on them. An example of this would be the swap device. It resides within a division, but does not contain a filesystem. A filesystem is a division that has a particular structure. As the name implies, it is a system of files. All files, including device nodes, exist within the boundaries of a filesystem. The most familiar filesystem is the root filesystem. The device that points to this special location on the disk is /dev/root. (logical, huh?) The structure of filesystems is discussed in more detail in chapter 6.

If we look at the device node /dev/root, we see that it has a major number of 1, so we know it lies somewhere on the hard disk. The minor number is 40 which we know, from our discussion of major and minor numbers, it on the active partition of the first drive. There is nothing special about this particular location on the disk. In fact, what division the root filesytem is on got changed in SCO Open Server.

By convention on ODT, the first filesystem on the active partition of the first hard disk is the root filesystem. The system knows that in order to function properly, it needs a device /dev/root. However, nothing prevents it from having a different minor number or residing elsewhere on the disk. For example, if there is only one partition on the first disk, the root filesystem could also be referred to with the major/minor number combination 1,8. Where 8 is the first division on the first partition.

Note that if there is only one partition on the first hard disk and it is active, the major/minor number combinations 1,40 and 1,8 are the same place. If there is more than one, then the active partition does not need to be the first one. Therefore, the first filesystem on the active partition may not be 1,8. It may be 1,16 or 1,24 or 1,32. If you were paying attention, you know why the minor number goes up in increments of 8.

For the fun of it, I created multiple partitions with different SCO products (UNIX, XENIX, ODT). Rather than having to change the active partition when I wanted to switch to a different operating system, I left one partition active and passed bootstrings to /boot (See the section on starting your system for more details). Here I referred to the root and swap devices by their absolute minor numbers. In order to be safe, I even changed the device nodes to reflect this. So for the first partition, /dev/root was 1,8 and /dev/swap was 1,9.

In addition to these two devices, it is often common to have a filesystem used exclusively for data, often /dev/u. If you have enough space, you will be prompted to create one during the installation. It is commonly the third division and has a minor number of 42. In addition to these familiar divisions, there is also another division whose function is not commonly known. This is /dev/recover. As its name implies, it is used in recovery processes, specifically when fsck tries to clean filesystems during boot. Since fsck does not know the state of the filesystems being cleaned, it cannot safely write to them when it is trying to clean them. (kind of like trying to mop the floor you are standing on) Instead it writes it outputs to a reserved area of the hard disk called /dev/recover. Each of these also has their own 'raw' counterpart: rroot, rswap, ru and rrecover.

Floppy Devices

As with the hard disk devices, there are both XENIX and UNIX floppy devices. The XENIX devices reside in the /dev directory and begin with fd. As I mentioned in the section on major and minor numbers, the characters following the fd tell us characteristics about the device. Let's look at a typical example: /dev/fd0135ds18. The first character following the fd is the drive number. You will normally only find (at most) two floppy drives so this is either a 0 or a 1. At the end is the number of tracks per inch. In the above example, we interpret the device as follows:

Figure 0-10 Explanation of Floppy Device Names

As I mentioned before, both the /dev/dsk and /dev/rdsk directories contain device nodes for SysV disk devices. This includes floppy disks as well. Floppy devices here are recognized by the first character being an 'f'. Subsequent letters, relate to specific characteristics of the device.

The second character tells whether it is floppy 0 or 1. The third character tells, not how many tracks per inch, but rather the diameter of the disk. Therefore a 5 in this column is for a 5.25" floppy and a 3 is for a 3.5" floppy. The fourth column is for the density, with possible values 'h' for high, 'd' for double, and 'q' for quad. The remaining values will vary. The following digits, such as 9, 15 or 18 represent the number of sectors per track. Many of these devices have aliases which represent the same device. For example, /dev/dsk/f05d9 and /dev/dsk/fd0d9d are the same device. For details, see the fd(HW) man-page

There are a few floppy devices that require special attention. These are the /dev/install, /dev/install1, /dev/dsk/finstall and /dev/dsk/finstall1. (and, of course, their character device counterparts) Like all the other devices, SCO typically uses the XENIX devices, so only /dev/install and /dev/install1 are more commonly used. As one would guess these devices are used to install. When running custom to install additional software, the system looks at /dev/install by default. The really nice thing about these devices, is that it doesn't matter what type of media you have. The floppy device driver will figure out what kind of floppy you have. In fact, if you are unsure of the media you have, access /dev/install helps you to figure it out. Should you need to access the second floppy (for example, if your installation media is a different size), /dev/install1 provides the same functionality.

Terminal Devices

Perhaps more common than floppy devices (at least used more often) are terminal devices. The first thing that may come to mind when thinking about terminal devices is serial terminals that are attached to the systems. While these are possibly the ones commonly used, these are not the only ones on the system. In fact, there are a wide range of device types that appear as tty devices in /dev.

If we did a directory listing of /dev, we would find at least five different tty (terminal) devices. Let's first talk about the serial terminal devices that one might want to attach to a standard serial (COM) port. These are probably the only ones that everyone thinks about when they talk about terminal devices. If we look at the some serial devices in the device directory with:

l /dev/tty[12][a-dA-D]

we get something like this:


crw-rw-rw-

1

bin

bin

5,

0

Dec

14

20:40

/dev/tty1a

crw-rw-rw-

1

bin

bin

5,

1

Dec

14

20:40

/dev/tty1b

crw-rw-rw-

1

bin

bin

5,

2

Dec

14

20:40

/dev/tty1c

crw-rw-rw-

1

bin

bin

5,

3

Dec

14

20:40

/dev/tty1d

crw-rw-rw-

1

bin

bin

5,

128

Dec

14

20:40

/dev/tty1A

crw-rw-rw-

1

bin

bin

5,

129

Dec

14

20:40

/dev/tty1B

crw-rw-rw-

1

bin

bin

5,

130

Dec

14

20:40

/dev/tty1C

crw-rw-rw-

1

bin

bin

5,

131

Dec

14

20:40

/dev/tty1D

crw-rw-rw-

1

bin

bin

5,

8

Dec

14

20:25

/dev/tty2a

crw-rw-rw-

1

bin

bin

5,

9

Dec

14

20:25

/dev/tty2b

crw-rw-rw-

1

bin

bin

5,

10

Dec

14

20:25

/dev/tty2c

crw-rw-rw-

1

bin

bin

5,

11

Dec

14

20:25

/dev/tty2d

crw-rw-rw-

1

bin

bin

5,

136

Dec

14

20:25

/dev/tty2A

crw-rw-rw-

1

bin

bin

5,

137

Dec

14

20:25

/dev/tty2B

crw-rw-rw-

1

bin

bin

5,

138

Dec

14

20:25

/dev/tty2C

crw-rw-rw-

1

bin

bin

5,

139

Dec

14

20:25

/dev/tty2D

(NOTE: Most systems will only have tty1a, tty1A, tty2a and tty2A. The above devices would only exist if you had a 4-port, non-intelligent serial card on both COM1 and COM2.)

If we look closely, we begin to see that, here too, there are patterns in the relationship between devices names and major/minor numbers. The most obvious pattern is that each of these devices has a major number of 5. By looking in /etc/conf/cf.d/mdevice, we can see that this is the major number for the serial device driver (sio). The first terminal device in this is tty1a. This is the first terminal attached to COM1 and as you might guess it has a minor number of 0. Each subsequent terminal device increases its minor number by 1 as you would expect. This ends at minor number 3 for as there are only four tty devices attached to COM1 in this example.

Following these devices are more terminal devices with very similar names. The only difference is that, rather than using lower case letters, they end with uppercase letters. If we look, we can see that tty1A has a minor number that is 128 above tty1a, tty1B has a minor number that is 128 above tty1b and so on. This is accomplished by simply using the high order bit (27) as the flag to determine how the device should be accessed. The uppercase letters represent modem-control ports and lower-case letters represent non-modem-controll ports. For more details on the differences see the section on modems in chapter on hardware.

If we look at the devices attached to the second COM port (tty2a, tty2A, etc) and compare them to the devices on COM1 we also see a pattern. In each case, the minor numbers of the COM2 devices are 8 above their COM1 partner. This means that there can be only 8 terminal devices attached to a non-intelligent serial port.

Things change a fair bit when dealing with intelligent multiport boards (those that do the I/O processing themselves). They require their own drivers and therefore use different major numbers. In addition, the numbering scheme is dependent on the manufacturer, but usually follows a similar scheme.

The next kind of terminal devices that most people are aware of are the console multiscreens. Using these devices it is possible to have several different "screens" on the system console. By default there are twelve console multiscreens and are accessed through the keyboard combinations ALT-F1 through ALT-F12. For additional details on accessing these, see the multiscreen (M) man-page.

Although each of these get input from the same keyboard and output appears to be going to the same screen, the system treats them as separate devices. When you switch screens, the kernel is actually displaying different parts of its memory. You can see how much memory is reserved for these screens by looking for the %console line in the hardware screen at boot or by running hwconfig. This will probably be set at 64K. When you switch screens, the kernel knows which screen to display by the key combination pressed.

However, there is more to keeping track of each screen than just displaying the right image to the monitor. Each time you login, you have a totally new process. Input and output are independent of any of the other screens. The kernel maintains an internal table of which process belongs to which screen and manages this through the minor number of the console devices (tty0-11). In addition to the console multiscreens, there are three more console devices that used at various times.

The terminal device used by the system administrator in single-user (maintenance mode) is /dev/console. Next is /dev/systty. This is the device to which system error messages are output. There is also the device /dev/syscon. This device is used by the init process to communicate with the system administrator during system startup and while in single user mode. Note that these three are almost always linked together.

Not quite a console device, but using the same major number (3) and therefore the same driver is /dev/tty. This is a special terminal devices that represents the terminal that you are currently on. Or to be more correct, this the control terminal associated with the process group of a given process. It is useful in programs or shell scripts to ensure that output is written to the terminal no matter how output has been redirected.

Try doing a date > /dev/tty. No matter where you are, it will always appear on your screen.

Parallel Devices

Another kind of device that is very familiar to users is printers. Printers are most commonly accessed through either serial or parallel ports, although ones with built-in networking cards are becoming more common popular.

If it's a serial printer, it is accessed through one of the tty devices attached to a COM port that we mentioned above. (tty1a, tty2a, etc) If it's a parallel printer, then it's accessed through the devices /dev/lp, /dev/lp0, /dev/lp1 and /dev/lp2. The device /dev/lp is the default and is probably linked to one of the others. (If not, there is something unusual on you system.) On most machines there is only one parallel port and this is probably /dev/lp0. Therefore, /dev/lp is probably linked to /dev/lp0.

Although these are the printer devices that appear by default on your system, they are not the only ones that the system will recognize. By default the system responds to devices requests for service through interrupts. On busy systems, interrupts may be missed and printing slows. This is because an interrupt is generated for every time the printer wants to tell the operating system to send more characters. In order to speed things up a bit, device nodes can be created that cause the parallel driver to poll the ports (ask them specifically if they have work to be done), rather than relying on interrupts. If polling devices exist, the parallel port devices might look like this:

crw-------

2

bin

bin

6,

0

Nov

15

18:28

/dev/lp

crw-------

2

bin

bin

6,

0

Nov

15

18:28

/dev/lp0

crw-------

1

bin

bin

6,

64

Dec

16

20:40

/dev/lp0p

crw-------

1

bin

bin

6,

1

Mar

23

1993

/dev/lp1

crw-------

1

bin

bin

6,

65

Dec

16

20:40

/dev/lp1p

crw-------

1

bin

bin

6,

2

Mar

23

1993

/dev/lp2

crw-------

1

bin

bin

6,

66

Dec

16

20:40

/dev/lp2p


Along with the standard parallel devices /dev/lp, /dev/lp0, /dev/lp1 and /dev/lp2, we see three polling devices /dev/lp0p, /dev/lp1p and /dev/lp2p. The minor numbers for the standard parallel ports start at 0 and go up to 2. Here again, we see devices making use of bit patterns to correspond to different functionality. By turning on the sixth bit for the polling devices, we end up with minor numbers that are 64 above their non-polling counterparts.

Tape Drives

A somewhat less well-known, but very important piece of hardware on a UNIX system is the tape drive. Most end users don't think about this kind of device until they accidentally erase a file and need some way of getting it back. Although tape drives come in all shapes and sizes, both internal and external, the device nodes they are accessed through remain fairly consistent.

The "standard" device is /dev/rct0, which stands for Raw Cartridge Tape 0. It usually has a major and minor number of 10,0. I say "usually" because this is one of the few instances where the major and minor are not always the same. Depending on what kind of tape driver you have, the "real" tape drive is linked to /dev/rct0. As a result instead of it's default major and minor, it has the one appropriate to the tape drive being used. (WARNING: That is the number zero at the end not the letter "Oh”. I have had many customers who suddenly lost all the space on their hard disks by backing up to /dev/rcto or /dev/rctO. As a result they end up with a 150Mb file called /dev/rcto.)

Let's look at an example. Assume that a QIC-02 tape drive is installed on the system (More details on this in the chapter on hardware). This will be accessed by the device /dev/rct0 with the major and minor numbers 10,0. If I later decided to replace this tape drive with a SCSI, I can install the SCSI tape drive and the system asks if I want this to be the default tape drive. If I answer yes, the SCSI tape drive (/dev/rStp0) gets linked to /dev/rct0. If I do a long listing with the inode, the two devices look like this:


137

crw-rw-rw-

3

root

root

46,

0

Dec

15

17:42

/dev/rStp0

137

crw-rw-rw-

3

root

root

46,

0

Dec

15

17:42

/dev/rct0

All of a sudden, the device /dev/rct0 has a different major number and therefore a different device driver. (Note that the major number 46 may not be the same on your system. SCSI devices can be added at any time so they are dynamically assigned major numbers when they are added.)

As you can see, these two device nodes (/dev/rct0 and /dev/rStp0) not only have the same major and minor number, but also the same inode. Therefore, they are the same device (file). If you were looking at /dev/rct0 and didn't see the 10,0 you might think that something was wrong with the system. (Well, at least I did during one of the first calls I took where the customer was having tape drive problems.)

Another commonly used tape device is /dev/xct0. This is used to issue control commands to the tape drive (ioctls for you programmers). By default, this device keeps it's major and minor numbers of 10,128. Control commands include things like rewind and retension.

This major/minor number pair causes problems if you have a SCSI tape and you try to issue tape commands. The best thing is to either link your control tape device (i.e. /dev/xStp0) to /dev/xct0 or change the entry in /etc/default/tape to either /dev/rStp0 or /dev/rct0. The minor number for any control tape device is 128. So the major and minor number for a SCSI control tape device would be 46,128. This is what the control device looks like with a SCSI tape drive:

144 crw-rw-rw- 1 root root 46,128 Mar 23 1993 /dev/xStp0

In addition to these two, another tape device is /dev/nrct0 (or /dev/nrStp0). The 'n' simply means that this is a non-rewind device. This can be used is you want to store multiple archives on the same physical tape. In the /dev directory the entry would look like:

143 crw-rw-rw- 1 root root 46, 12 Mar 23 1993 /dev/nrStp0

Care needs to be exercised when using this device. Some tape drives will eject (unload) the tape when the tape processes completes successfully. If you are doing backups of multiple volumes, ejecting the tape is not something you want. Therefore, built into this device is a no-unload mechanism. It is bit 3 that determines whether this is a rewind device or not. So a purely rewind device would have the minor number 8 (23). A no-unload device is determined by the second bit giving a minor number of 4 (22). Since the minor number here is 12 (8+4), we know that this device is both no-rewind and no-unload. Bare in mind that normal cartridge tapes cannot be ejected so there is no need for a no-unload device /dev/nurct0. The no-unload device looks like this:

crw-rw-rw- 1 root root 46, 4 Mar 23 1993 /dev/nurStp0

Also for SCSI tape drives, the unload device /dev/urStp0 is automatically linked to /dev/rStp0. That's why there was a '3' in the links column for the listing of /dev/rStp0. The third link is for /dev/urStp0. Should you have OpenServer and your tape drive support error correction, you can use the device /dev/erct0. However, not all tape drives support error correction. So, although this device exists, you may not be able to use this device. If so, /dev/erct0 would look like this:

crw-rw-rw- 2 root other 10, 32 Dec 22 16:53 /dev/erct0

Other SCSI tape devices such as DAT or Exabyte, simply use the same tape devices as a SCSI cartridge tape, /dev/rStp0.

Another common tape device is those that use the floppy controller. These are the QIC-40, QIC-80 and IRWIN tape drives. The QIC-40 and QIC-80 are referenced by the floppy tape device 'ft0'. Since only one floppy tape drive is support, there will never be a 'ft1'. As with the other tape devices, there is the raw device, /dev/rft0 and the control tape device, /dev/xft0.

Each of these are linked to a 'mini' tape device since QIC-40s and QIC-80s are often referred to as mini tape-drives. Please note that if you are installing one of these, there is a special entry in the mkdev tape. The Mini-Cartridge entry is for Irwin tape drives only.

A long listing of the device nodes for a QIC-40 or QIC-80 might look like this:

crw-rw-rw-

2

root

other

13,0

Dec

20

18:46

/dev/rctmini

crw-rw-rw-

2

root

other

13,0

Dec

20

18:46

/dev/rft0

crw-rw-rw-

2

root

other

13,128

Dec

20

18:46

/dev/xctmini

crw-rw-rw-

2

root

other

13,128

Dec

20

18:46

/dev/xft0

As you can see here as well, the control tape devices are again 128 above the standard device. In the above example, the tape drive was installed as floppy unit 1, hence the ‘0' minor number for /dev/rft0. Should the tape drive be installed as device 2, the minor number will be 1 as drive 3 the minor number will be 2, etc. As with the other kinds of tape drives, 128 will be added to these numbers for the control device.

Although Irwin tape drives have their own drivers, they are physically the same as QIC-40 and QIC-80 tape drives. The device for an Irwin tape drive as unit 1 is /dev/rmc0 with a major and minor number of 33,0. As unit 2, this would be /dev/rmc1 with a major and minor of 33,1. Since both QIC-40/80 and Irwins are referred to as mini-tape devices, either will be linked to /dev/rctmini and /dev/xctmini depending on which one is installed.

CD-ROM Devices

If you've installed with a CD-ROM or subsequently added one to your system, the will be device nodes for two kinds of CD-ROMs. The first kind is the one most people are familiar with. This kind can be mounted and accessed like any filesystem. The two devices create for a CD-ROM of this type, might look like this:

brw-rw-rw- 1 root other 51, 0 Nov 28 19:48 /dev/cd0

crw-rw-rw- 1 root other 51, 0 Nov 28 19:48 /dev/rcd0

As with hard disks and floppies, the raw device (/dev/rcd0) has the same major and minor number as the block device (/dev/cd0). The other type of CD-ROM device is used only for installation of software. This is the so-called CD-Tape device. The device nodes for this kind of CD-ROM would look like this:

crw-rw-rw- 1 root other 50, 8 Nov 28 19:47 /dev/nrcdt0

crw-rw-rw- 1 root other 50, 0 Nov 28 19:47 /dev/rcdt0

crw-rw-rw- 1 root other 50,128 Nov 28 19:47 /dev/xcdt0

We see here devices that are similar in name and minor numbering scheme. The ‘t' right before the device number indicates it is the CD tape device. The device with the minor number of 128 is again the control device. The third device, /dev/nrcdt0 is similar in functionality to /dev/nrStp0. It is a no-rewind device. However, unloading the CD is not something normally done, therefore this is not a no-unload device.

Mice

***********

What's the best description of master and slave devices???

**********

For the most, mice are only used on the main console or with X terminals. Rarely are they used in conjunction with anything other than X-Windows. However, there is the usemouse utility that allows you to use a mouse with vi or sysadmsh.

There are three basic types of mice. I say "basic types", because there are only three types of drivers that you can install. If the pointing device you are using is commonly referred to as a trackball, the signals it send to the computer are the same as if it were a mouse. (You roll the ball forward and the cursor moves up. You move the mouse forward and the cursor moves up as well.) The same thing applies for cordless mice.

If you install any mouse on your system, the associated device resides in /dev/mouse. (Logical place, huh?) Should you install a serial mouse, the system will actually use the same serial driver as does a standard COM port. This makes sense as the signals coming from the mouse are essentially the same as if it were a terminal or modem on that port. On my system with a Logitech trackball, I have two devices in /dev/mouse:

crw-rw-rw- 2 bin bin 5, 0 Feb 26 1994 logitech_ser0

crw-rw-rw- 2 bin bin 5, 0 Feb 26 1994 mouseman_ser0

As we can see, these devices have a major number of 5, which is the standard serial driver. Adding a keyboard mouse, creates the device /dev/mouse/kb0, with a major number of 20. Bus mice have a major number of 16 and have the name /dev/mouse/bus0 or /dev/mouse/bus1.

In addition to the more well-know mouse devices, there are dozens of additional devices in /dev/mouse. These are the slave pseudo-mouse device with names /dev/mouse/mp?, which have a major number of 17 and the master pseudo-mouse device with names /dev/mouse/pmp? and a major number of 18. (It is very common for master/slave pairs to have major numbers that are off by 1.)

Miscellaneous Devices

The directory /dev/string contains special devices that are used as the system is booting. The first thing used is /dev/string/boot. This is the string that is built from user input at the boot: prompt as well as from the contents of /etc/default/boot.

Should you want to install a boot-time loadable driver, the device /dev/string/pkg contains information about the devices being linked in. The last of the three, /dev/string/cfg, contains the concatenation of the configuration strings shown at boot. These are the lines that show up as %disk, %floppy, etc as the system is booting. Do a cat of /dev/string/cfg and compare that to the output of hwconfig.

At this point we begin to break into the more esoteric device nodes on the system. (As if the devices in /dev/string are not esoteric enough) All of the devices we have talked about most users have seen at one time or another. With over a hundred different devices to go, there are only few that the average user has seen before. Although many of these are pretty familiar in terms of being recognizable as part of the system, very few are common enough for even the most seasoned administrator to have cause to use directly.

In most cases there are only a couple devices of each type. Therefore, the minor numbering schemes that we talked about above, just aren't needed. Instead, minor numbers are assigned relatively arbitrarily. It is difficult to decide where to begin, so let's flip a coin and start with the device nodes that represent something physical.

The first set is the video devices, or display adapters. These all have the major number 52. If we look in /etc/conf/cf.d/mdevice we see that major number 52 is the da device, which stands for display adapter. These are simply the drivers for your video card. The names are obvious and the devices in this group are: /dev/cga, /dev/color, /dev/colour, /dev/ega, /dev/mono, /dev/monochrome, and /dev/vga.

Next we have memory devices. These have a major number of 4. To access the system's physical memory, you would use /dev/mem. It's partner, /dev/kmem is used to access kernel memory. Lastly, we have /dev/null, aka "the bit bucket." This device essentially represents non-existence. To send output to never-never land, you redirect it to /dev/null. To destroy the contents of a file you can send the contents of /dev/null into that file.

Another memory device is /dev/ram00. This represents a 16K ramdisk, which is the smallest the system allows. It has a major number of 31. Since this is the smallest size one can make a ramdisk, it is logical that the minor number be 0. However, if you want to create a larger ram disk you can. You could also create multiple ramdisks of a given size. For more details on ramdisks, see the ramdisk(HW) man-page

Next are a few devices that access the CMOS and the system clock, these are: /dev/clock and /dev/rtc which both act as interfaces to the hardware real-time clock and have a major number of 8; /dev/cmos, which is the interface to the system CMOS; and /dev/mcapos; which is the interface to the CMOS clock. Both of these have a major number of 7.

The next device /dev/prf, used by the operating system profiler, stands alone. It has no other devices related to it. This device is used to interface with profiling information and addresses. It's major number and minor number are 9,0. Another device that stands alone is /dev/error. It's major number is 32 and like the other "loner" device, it has a minor number of 0. The purpose of this device is to make error messages available to system daemons.

Should you have system auditing enabled, your system will be busily accessed the two devices /dev/auditr and /dev/auditw. These are, respectively, the audit read device and audit write device. The audit daemon uses /dev/auditr to get audit collection information. Applications that are authorized to write audit records will write them to /dev/auditw. These have the major and minor numbers of 21,0 and 21,1 respectively.

Changes in OpenServer

The first change I noticed to the device nodes was that /dev/root is no longer major/minor 1,40, but rather at 1,42. The reason for this is the introduction of the /dev/boot filesystem. The /dev/boot device is at 1,40, since during boot things are expected to be in the first filesystem of the active partition. Here is where the boot program lies, as well as the kernel itself. Once the kernel is running, access of the root filesystem is made through the device node /dev/root and is independent of the major and minor number. Because of this, it was easier to put the /dev/boot in the first filesystem and then mount it onto /boot, rather than changing dozens the programs to boot somewhere else. There is more significance to /dev/boot filesystem that we get into in later sections.


Another change is moving the device /dev/prf into /dev/string. In doing this, it no longer has a major minor number of 9,0. Instead, it was changed to conform to the pattern of the other devices in /dev/string. Now it's major number is 34 minor number is 4.

New devices within /dev/byte provide an unlimited, continuous stream of bytes with a constant value. The minor number they use (23) is for the 'byte' device driver. These devices can be used, for testing purposes. Maybe you want to test a serial device, where you want to ensure that the characters coming through the line have the same value. The example in the byte(HW) man-page describes sending a stream of 0x07 through /dev/tty2d. Since this is the <BELL> character it would make a good test. Once the connection is made, you hear the bell.

In my mind, the man-page is a bit confusing in terms of what value is sent. It says that, "The value of each byte is the same as the minor device number of the file." Okay, what value? If I have a device /dev/byte/hex/37 with a minor number 37, should the device output a hex 0x37 or decimal 37? The answer is that it has to be the decimal value. Consider the octal values in particular. If you wanted to sent a byte with the decimal value of 255, this would be 377 in octal. Since the minor number cannot be over 255 (they're stored in one byte), there is no way to create a device with that minor number.

The reason for the different directories is to make things easier for humans, For example, if I wanted to send a stream of decimal 37, I could create a node (using the mknod command) in /dev/byte/decimal called 37 with a minor number of 37. If you did a cat of this device, you'd a stream of percent signs parenthesis filling up your screen. You could also create a device /dev/byte/hex/25 that would do the same thing.

To make like easier, I could even create a new directory, /dev/byte/ascii where the names of the device nodes are their ascii names. Using the example above, I could create a device /dev/byte/ascii/%, that looked like this:

crw-r--r-- 1 root sys 23, 37 Jun 8 11:24 /dev/byte/ascii/%

Catting this devices gives me the same stream of percent signs.

There is also a special device /dev/zero that can be used as an unlimited source of zero-value bytes.

Also new to OpenServer is the /dev/table directory. Each device has the major number 34, which is for the tab driver. These devices provide access to various system tables. The minor number of the device determines what table is read. This scheme allows commands and utilities access to the tables without the need to understand where the tables reside in kernel memory. Note that the information that these devices provide is in binary format that the appropriate command knows how to read. Simply doing a cat of one of these devices can result in a locked terminal or other such nastiness. (Trust me, I know)


Name

Minor Number

Description

Structure

proc

16

Processes

proc

pregion

17

Process regions

pregion

region

18

Regions

region

eproc

19

Process extensions

eproc

file

20

Open files

file

inode

21

Active i-nodes

inode

s5inode

22

S5/AFS i-node cache

s5inode

diskinfo

24

Drive partitioning

diskinfo

dkdosinfo

25

DOS partitions

dkdosinfo

clist

26

Character buffers

cblock

mount

28

Mounted filesystems

mount

flckrec

29

Outstanding file locks

filock

avenrun

30

Run averages

short unt

sysinfo

64

System information

sysinfo

minfo

65

Paging and swapping

minfo

extinfo

66

Other information

extinfo

v

48

System configuration

var

tune

49

Tunable parameters

tune

Table 0.4 Contents of /dev/tab

OpenServer introduced the concept of virtual disks. Although the concept is not new to the computer world, OpenServer is the first SCO release to provide them. Although we get into more details later, we can briefly describe them as where you take portions of one or more physical disks and make them appear as if the were one logical (or virtual) drive. The device nodes associated with virtual drives are of the form /dev/rdsk/vdisk# or /dev/dsk/vdisk# where # represents the number of that virtual drive.

Also new to OpenServer is direct support for floptical drives. Floptical drives allow access to normal 3.5" floppies, but also to special magneto-optical floppies that can store up to 21 Mb of data. We'll cover the physical characteristics of floptical drives in the section on hardware.

The device driver to access floptical is Sflp, with a minor number of 48. The minor device numbering scheme is similar to that for hard disks and floppies. Table 0.2 show how the bits are defined. The devices themselves reside in /dev/dsk if they are block devices and /dev/rdsk if they are character devices. Device names have the format:

fp<device_number>3<density>

Where density is d for 720Kb (double density), h for 1.44Mb (high density) and v21 for 21MB (very high density).



Bits











9

8

7

6

5

4

3

2

1

0

Description


-

-

-

-

-

-

-

0

0

0

720KB (DD)


-

-

-

-

-

-

-

0

0

1

1.44MB (HD)


-

-

-

-

-

-

-

1

0

0

21MB (VHD)


-

-

-

-

-

0

0

-

-

-

Reserved


-

-

-

-

0

-

-

-

-

-

Set to zero


-

-

-

0

-

-

-

-

-

-

Floppy disk


X

X

X

-

-

-

-

-

-

-

Unit # (0-6)


Table 0.5 Bits for Floppy Device Minor Numbers

Another new, and in my opinion interesting, set of devices are those accessed through the marriage driver. These devices reside in the /dev/marry directory and allow you to access a regular file as if it were a device node. You can ever create filesystem on such devices, mount them and even run fsck, if you are so inclined. The only permanent file is /dev/rmarry, with major number of 76 and minor 0. Other devices with also have a major number 76, but their minor number is dependent on the order in which they were created. Unless you've established marriages before, then you probably don't have a /dev/marry directory.

If your machine is capable for Advanced Power Management (APM), then you may have a /dev/pwr/bios device on your machine. The concept of APM is that your machine can detect when it hasn't been used for a long time and can turn off certain functions. Note that not all machines that have this ability can communicate with the operating system.

If you have a PCI bus on your machine, you can gain access to it using the /dev/pci device. This has major number 22.

In our discussion on shell basics, I talked about the three initial file descriptors (0, 1 and 2) that represent, respectively, standard input, standard output and standard error. OpenServer provides three device nodes that scripts or programs can access without having to explicitly know where these file descriptors point. These are /dev/stdin, /dev/stdout and /dev/stderr. All have a major number of 24, which is for the dup driver. Opening one of these devices is equivalent to issue a dup() system call, which makes a duplicate of the file descriptor. The minor numbers of these three devices are the same as the file descriptors they represent. So /dev/stdin has a minor number of 0, /dev/stdout has a minor number of 1 and /dev/stderr has a minor number of 2. The directory /dev/fd is used for duplicates, for file descriptors 0 through 99.

The Link Kit

One of the major advantages SCO has is that SCO tries to be as open as possible. This means that SCO tries to support as many different kinds of devices from as many different vendors as possible. SCO supports more different kinds and manufacturers of devices than any other UNIX vendor. For sales people this is a major advantage because they can often sell products based on the fact that certain hardware is "100% compatible" with a name brand that SCO supports. This is not to say that sales people tries to deceive you. Rather that the customer often upgrades an operating system for hardware they already have. The vendor may claim compatibility, but this becomes a nightmare for SCO Support.

For example, SCO has never made the claim that it supports the WonderTrack SCSI Tape Drive1. However, since the manufacture claims that it is "100% compatible" with an Archive 5150, which SCO does support, customers believe that SCO is obligated to help them if things go wrong.

The real bottom line on this issue is whether or not the drive is manufactured by Archive. If so, and it just has a different label on it, SCO support goes from the assumption that it is an Archive in disguise and they will try to help as best they can.

Unfortunately things are not always that easy. Many brands claim compatibility, not because they actually are the brand they claim to be, but rather because their engineers used the supported brand as a model and built their new drive based on that model. The result is a crap shoot at best.

However, don't tell some customers that. Many, if not most, will insist that since the manufacture claims it is 100% compatible then every operating system vendor is obligated to support it. This goes for SCO as well.

I have talked with at least one customer with this attitude. Although the SCO SCSI drivers work with almost every other kind of SCSI tape drive, but not his, then the SCO driver must be defective. It wasn't a hardware problem, he insisted, because the tape drive works fine under DOS. (No need to mention that in the box the tape drive came in was a DOS driver from the drive manufacturer. That obviously had nothing to do with it working under DOS and not under SCO.)

If the drive manufacturer supplied a SCO driver, then its obviously not the same one as SCO supports. Therefore, the customer needs to talk to the manufacturer. Since there was no driver, the customer's assertion was that it worked just like the "real" one and therefore was supported by SCO. Since there was a "bug" in our driver and we support the tape drive that his is compatible with, we were obligated to stay on the phone with him until we had a resolution. At least, that's what he contended.

The Flow of Things

Anyone who has installed an SCO UNIX system has probably gone through at least one kernel relink. Depending on the speed of the computer (both processor and I/O sub-system) and the number of devices, this process may take several minutes. During that time very little activity appears on the screen. If it weren't for the flashing hard disk light one might think that the machine had hung. After a few minutes, you are reassured that the machine has, in fact, not hung, but is continuing the task of building a new kernel.

Many of you may have run into problems when trying to add a new device to the system that the installation script said that the link kit needs to be installed before you can add the driver. I have talked with many customers, running ODT 3.0 who believe that there is something wrong with their system because the link kit is missing. All this means is the when the system was first installed, whoever did the install just didn't install the link kit. If the installation script for this new device does not ask you if it should install the link kit, you can use custom or the Software Manager. The link kit is part of the Extended utilities package in ODT, but part of the CORE package in OpenServer. Because a kernel relink is needed for the changes to take effect, you cannot change kernel parameters without the link kit being installed.

In the process of a kernel relink (or kernel rebuild as it is often called) several major things take place. One of the first things is construction of tables based on kernel parameters. These parameters are at values that are either set by default or ones that are set by the system administrator. These are commonly referred to as "kernel tunable parameters, or "kernable tunables" for short. (Not really. One of the first UNIX gurus I met in SCO support used to get tongue-tied when explaining kernel tunable parameters and would often slip and say "kernable tunables." Ever since then it sort of stuck within the support department.)

Another major part of the kernel rebuild is the linking of all the appropriate drivers into the system. I say "appropriate" because some drivers exist on the system, which may have to be explicitly linked into the kernel. As I mentioned before, these drivers serve as the interface between the operating system and the hardware. In addition to the drivers, the system must ensure that all the necessary device nodes associated with each driver is present on the system. Remember that the system uses the device nodes to access the physical hardware. What's the point of having a device driver if there is no way to access the device?

For the most part, every system administrator understands that much about a kernel relink. Once the relink is finished, the changes we just made will be a part of the new kernel and once we reboot, these changes will take effect. New drivers are added and old ones are removed. If we need to change kernel parameters, this is the way to do it.

If you are like me, the knowledge that a new kernel is being built is not enough. You want to know more about what is happening during that rebuild. What steps is the kernel going through as it creates a new, slightly modified version of the operating system? In order to better understand the details of a kernel rebuild, we should first look at the big picture.

But what good is know all this? To answer that, I need to tell you a story from SCO Support. One of favorite customers to talk to was John Esak. Those of you who have been around for a while might remember John as the publisher of "The Guru" magazine. After he decided to stop doing "The Guru", John became a columnist in "SCO World" magazine. Through this, he often called into SCO Support.

I remember one call late in the afternoon during the last couple of months I was at SCO. John had installed a driver for a SCSI device that seemed to have given his link kit a severe case of indigestion. He had backups, but the only problem he had was that the relink failed miserably. Even after removing the driver, he was still getting errors.

Now John is not a stupid man. In fact, he is quite bright and I enjoy talking with him for that very reason. I know that when he calls, he has gone through all the 'basics' and that the issue is probably somewhat complex. This call was no exception.

One thing that comes in handy is listening to what the system has to tell you. The error messages that the relink was generating are often very useful in figuring out what is wrong. Fortunately, this was one of those cases.

During the relink, John was getting error messages about a missing device in mdevice (/etc/conf/cf.d/mdevice). (What mdevice is, we'll get to in a minute) When we checked mdevice, sure enough, the device was missing. I read the line in my mdevice file to John and we added this in the right spot in his mdevice. Crossing our fingers, we tried the relink again. This time we got the same error message, but referring to a different device.

Rather than replacing and relinking for every conceivable device, we decide to simply go through my mdevice file, line-by-line to see what John was missing. As it turned out, the removal script for that driver had removed all references to every other SCSI device on the system! Not a good thing to do. After we had replaced the missing entries, John was able to relink and reboot without any errors at all.

This is also a good case study to remind us that anytime you want to make major changes to your system or add unsupported drivers, it's a good idea to do a backup of your system or at least the link kit.

What does this have to do with the subject at hand? Well, if neither John nor I had an understanding of how a kernel relink work, we may have been forced to restore from the backups. Since both of us do know, we had the problem solved in about 30 minutes.

Knowing what happens during a kernel relink is not only important to know if something goes wrong, but is also helpful in understanding how the kernel is built. This includes what components go into the kernel and what process the system goes through to create a new kernel. Knowing how your operating system is put together is helpful in administering it.This is also helpful in understand many of the concepts and processes that we will get into later.

So, what's it all about? Well, the kernel is composed of a few core files, several dozen device drivers and a handful of configuration files. These device drivers are the sets of routines that the kernel uses to access everything else that is not part of itself. Everything from RAM to terminals are accessed through device drivers. The configuration files are used by the kernel to set internal parameters and define the characteristics of the device drivers and other aspects of the system. These files along with the programs used to build a new kernel are jointly referred to as the "link kit."

With the exception of a few odds and ends, the link kit resides in the directory /etc/conf. For simplicity's sake, I will make references to files and directories relative to this directory. Among the more significant sub-directories is the bin directory, which contains most of the programs used to build the kernel. The sdevice.d directory contains information about the particular hardware (Base address, IRQ) and whether or not it should be linked into the kernel. Paired with this are the sub-directories in pack.d. These directories contain the object code modules for the kernel and device drivers as well as configuration files.

The init.d directory contains information used to create a new inittab file. The inittab  file is what the system uses when booting to determine what processes it should start and when. Using the file cf.d/init.base as its starting point, the files in init.d are added to the end of init.base. These normally contain information about terminals that are attached to the system and whether the system should run a getty process on them to allow logins.

The working directory for the kernel build is cf.d. Using files in this directory, the kernel determines such things as which drivers should be included and what kernel parameters need to get changed and to what values. Since the work gets accomplished based on the information in the files in cf.d, this is probably the best place to start.

The first step in creating a new kernel is the program link_unix. This is found in the /etc/conf/cf.d directory and is started either by the system administrator or automatically by some installation scripts. If you are running SCO OpenServer, you can start a relink through the Hardware/Kernel Manager of SCOAdmin or through the sysadmsh in ODT or in both cases running the command /etc/conf/cf.d/link_unix. (Both sysadmsh and through the Hardware/Kernel Manager end up calling link_unix)

The first thing link_unix does is to remind you that this is a relatively lengthy process and asks you to "Please wait." After the lights flash and the disk rattles, you are asked a couple of questions and a few moments later a new kernel is resting quietly in your root directory. (/dev/stand on OpenServer) What happened? Aside from asking you if you wanted the kernel to boot by default and whether or not you wanted the kernel environment to be rebuilt, the system gave you little indication of what it was doing.

The first place to start is the link_unix program, which resides in /etc/conf/cf.d. Since link_unix is a shell script we can look at it. Because it is a small file, we can look at it's entire contents with a simple cat link_unix. However you can use any other viewing tools such as more, pg, or view.

In order to describe the relink process within a reasonable number of pages, I need to make some assumptions. The first is general knowledge about how unix command syntax works. The second assumption is that you have a basic understanding of shell programming. These are not required to understand the concepts in this section. If you have trouble with this, take a quick look at the section on shell basics or the sh(C) man-page.

After the comments, the link_unix program established the environment it is going to run under. This allows you to rebuild a kernel in a different environment that the one that exists on that machine. For example, you might want to use the kernel on a different system that has different drivers or different kernel parameters. You could even rebuild the kernel for a version of the operating system that was different that the one you were on. This is also useful when you want to test out a new driver or new kernel configurations without destroying your existing link kit.

In establishing the relink environment, we run across one of the first things that is different between the two releases. If you remember from our discussion on SCO basics, one of the key concepts in the new release is the idea of Software Storage Objects (SSOs). Many of the components of the link kit are symbolic links. Since many of the references to file during the relink are relative to the current directory, we have to ensure that all paths are evaluated correctly.

The evaluation of the paths is accomplished through the shell script /etc/conf/bin/path_map.sh which is "dotted" from within link_unix. The functions within path_map.sh are now available to link_unix and any script it calls. Rather than sidetracking too much, I will simply suggest that you take a look inside of path_map.sh to see how things are getting remapped.

In OpenServer, we now have to "build" the paths that were taken for granted in ODT 3.0. The last line of the script does an exec of idbuild passing to it all the arguments passed to link_unix. Which idbuild program is started, depends on the previously built paths. This is really where things get going.


Figure 0-11 The flow of a kernel relink

Since the link_unix script does an exec of idbuild, this is a good place to go from here. Luckily, idbuild is also a shell script, so we can examine it the same way we did with link_unix. Well, not exactly. Idbuild is six times larger than link_unix so examining takes a little more work. We could start our examination with the top line and dissect everything from there. However, this does little to help in our understanding of what's going on compared to the amount it would annoy, and maybe confuse us. Fortunately, there is something built into the shell that we can use. If we put a set -x as the first line of idbuild, we see what this script as it progresses. If you remember from our discussion on shell basics, this echoes every line to the screen prior to executing it.

However, if you just added the set -x and started link_unix, when it got to idbuild things would be scrolling across the screen. Therefore, we need to slow it down a little. We can do this by piping the output to more or pg. Remember that the set -x sends the output to stderr, so to send that through pg the command would look like:

link_unix 2>&1 | more

Let's start our examination by adding the set -x as I described above and then starting link_unix. If you have a standard 25 line terminal, you only see comments in the first screenful. Although in many scripts the comments tell you little more than copyright information, the comments here give you a quick overview of what's going on. Pressing the space bar once, (assuming we are in more) brings us the first lines of commands that are executed.

The first few things done are setting up variables to use later. One is the Preserve variable which determines whether or not intermediate files created during the rebuild should be removed. If set to YES, all the intermediate files are left alone when the rebuild is finished. Otherwise they get removed. Keep in mind that does not mean that the kernel does not have to recompile them the next time. It does. All this means is that there are a lot of .o files lying around which might be useful for debugging drivers or if you are curious.

New to OpenServer is the ability to include as well as exclude certain drivers. These are governed by the Aflag, Bflag and Exclude variables.

One aspect of the relink that can be changed is the development system used. By default, all the tools the relink uses are relative to the root directory of the system (/). This is defined by the DEVSYS variable. Being able to redefine the development system is useful, for example, when you want to use a different set of programs and libraries to do the relink. Maybe you have a different version of SCO Development System or one from some third party.

In addition to this, we also define the root directory of relink. By default this is / as well. In ODT 3.0, this is determined by calculating the directory name that is three levels up from where we start link_unix. Since link_unix is started in /etc/conf/cf.d, three levels up is /. Although this is really the same directory, this is three levels up from /etc/conf/cf.d and not /etc/conf/bin, where idbuild resides. This is because we are still in /etc/conf/cf.d, and we start idbuild as ../bin/idbuild.

Let's say we wanted to use an alternate link kit and we started the relink with /test/etc/conf/cf.d/link_unix. Then ROOT=/test. If it does turn out that ROOT is /, it needs to be altered slightly for certain constructs later. As a result it ends up as ROOT=/., with the period included. This is to prevent the system from thinking that, for example, the file //bin/date is the program /date on the machine //bin as would be the case on a machine running XENIX-Net or Lan Manager. To ensure the names come out right, ROOT is forced to be /./ instead of  //. Not to worry. The system interprets these correctly.

In OpenServer, the ROOT variable is checked. If it is set to anything other than /, it is left alone. If set to /, then it is reset to the null string (ROOT="").

Continuing through the file, we find the familiar message about the system rebuild and the request to Please Wait. We have seen that the root directory for the rebuild is not hard coded. The idbuild script tells us what it has determined ROOT to be with the line:

echo "\t Root for this system build is ${ROOT}"

After defining a variable used for path definitions (ID) and some locks (IDLCK), idbuild creates a flag that says the relink has begun:

>$ROOT/etc/.unix_reconf

At first glance, this looks pretty insignificant. All it does is create a zero length file. Although this file has zero length, it is very important as it serves as a flag. If for some reason the rebuild is not successful, the system needs to know that it had started in order to recover properly. We'll talk more about this file later.

Next, idbuild removes files that may have been left lying around and that it will later recreate (the several lines with rm -f). This is done to prevent any of the other utilities and programs from failing, as many of them cannot overwrite existing files.

Then, all the object files are cleared out of the cf.d directory. If this is a normal relink, this directory is /etc/conf/cf.d. We previously did a cd into $ID/cf.d on ODT 3.0 and into ${cf_d} on OpenServer. If we haven't redefined $ROOT to anything for testing, these would be /etc/conf. Therefore, we'll end up in /etc/conf/cf.d prior to starting to remove these files. However, even if we started the relink as /test/etc/conf/cf.d/link_unix, the paths are properly parsed. We would end up with $ROOT set to /test and $ID or ${cf_d} set to /test/etc/conf.

At this point, all the stubs.o and space.o files are removed from the sub-directories in /etc/conf/pack.d as well. The pack.d directory contains the object modules for all the drivers on the system as well relevant configuration information in the form of C-source code modules.

After this clean-up, idbuild uses awk to process the mdevice file. (normally /etc/conf/cf.d/mdevice) The mdevice file is the device driver module description file. It contains descriptions of the characteristics of each of the device drivers. Using the names that awk pulled out of the mdevice file, idbuild creates a new, temporary file, sdevice.new by concatenating the contents of the files in the sdevice.d directory. Whereas the mdevice file contains descriptions of all the possible device drivers, the sdevice file contains information only on the installed drivers.

In OpenServer, the processing of mdevice and sdevice is different. The primary reason is the ability to include or exclude specific drivers. Here, the flow of idbuild changes based on which of the flags is set. For example a normal relink follows one path, whereas one linking in just the base devices would follow another. One reason for just linking in the base devices is trying to get your kernel to fit on a boot floppy. If you link in all the drivers, it may be too large.

After creating the sdevice.new entry, the idbuild script creates two configuration files in the cf.d directory: sfsys and mfsys. These contain information about the filesystems configured. This is done by cat'ing the contents of the files in /etc/conf/sfsys.d and /etc/conf/mfsys.d.

Next, the video devices available are configured by idvidi. The program idvidi is also a shell script and sets up for the program idvdbuild. Unfortunately, idvdbuild is a binary program and since we don't have access to the source code, we can't take a look inside. Let's just say that the idvdbuild program configures the installed video devices. Should this process fail, temporary files left behind are cleaned-up and then the idbuild script exits.

Next, the idconfig program configures all of the drivers in the sdevice.new file. The idconfig program's primary responsibility is to read the system's configuration files and report any conflicts or errors. Here again, should this process fail, it cleans up temporary files left behind and then exits the idbuild script.

Despite the fact that idconfig is a binary program, we ought to look at it in some detail. By watching a relink carefully and making some intentional mistakes in some of the configuration files, you can get a feeling for the flow of things.

One thing this program is responsible for is to ensure the overall integrity of the various configuration files. If we look back at mdevice, we see that there are nine columns. Should one of the entries have fewer than nine columns, the system has no way of knowing which column is missing. Therefore, it has to stop to avoid creating an unbootable kernel.

As we mentioned earlier, adding or reconfiguring drivers is not the only reason that the kernel needs to be rebuilt. Changing the values of kernel tunable parameters also requires a rebuild of the kernel before the changes can take effect. Ensuring that these values are legitimate is also a function of the idconfig program. The first thing read is the mtune file. Since this is an ‘m-file', this is the master tuning file, and it contains the default values for the tunable kernel parameters. In addition, this file also defines suggested maximum and minimum values for these parameters.

Note that in some cases, the maximums can be exceeded and all that happens is that you get a warning during the rebuild. However, I recommend only exceed those values were you personally know the value will not cause trouble on your system.

Although it is possible to change the default value of a kernel parameter by directly editing the mtune file, this is not recommended. Defining changed values is the responsibility of the stune file. After reading the mtune file, idconfig reads the stune file to determine if any value has been changed. On any system, this file will contain some values, even if you have never made any changes yourself. These values are "changed" during the original installation relink.

As the idconfig wraps up it's business, one of the last things it does is to create the conf.c, config.h, vector.c and fsconf.c files. Without going into detail, these files are configuration files and are created by idconfig. The idconfig program outputs the individual lines of these files, including all the #include statements and the necessary external declarations and the arrays that are defined in these files. In fact, idconfig generates everything in these files.

The conf.c file contains external declarations and structure initialization for the kernel and driver routines. The header file for this is config.h and like other header files, this one contains #defines. The vector.c file contains information similar to conf.h, but it concerns itself solely with the interrupt vector table. Lastly, the fsconf.c file contains the file systems routine prototypes. After it has created all these files, the idconfig program has finished it's work. It then exits and the flow of the kernel rebuild is now back in the hands of idbuild.

One thing that I have to point out is a comment that appears in each of these programs which says:

/*

* This file is automatically generated by idconfig(ADM),

* usually run by idbuild(ADM). *** DO NOT EDIT ***

*/

Listen to it! Do not edit any of these files. If you're very lucky, all that will happen is you end up with a kernel that won't boot. Why is that lucky? Well, if you are unlucky, you trash your filesystems.

The next program run from idconfig is idscsi and as it's name implies, it is responsible for the SCSI devices on the systems. The idscsi program could have been included in idconfig. However, by itself it is half the size of the entire idconfig program. From a programmers point of view it's much easier to maintain as two separate programs. There is also the logic that because of the very nature of SCSI devices, it should to be handled separately.

For example, in order to access a IDE hard disk, the kernel needs to access the IDE controller. However, to access a SCSI hard disk, the kernel needs to know not only how to access the SCSI hard disk, but also how to access the SCSI host adapter. This is same level of complexity applies to any SCSI device. I realize that this is a vast oversimplification. However, this does show how SCSI devices add an additional level of complexity to the system.

The basic configuration file for SCSI devices is cf.d/mscsi. It contains information about what SCSI devices are on the system, how they are configured and what SCSI host adapters they are attached to. However, mscsi is not the only file used. As I mentioned early, mdevice contains configuration information about all the devices on the system—including SCSI devices. If there were no entry in mdevice for a particular SCSI device, idscsi would issue a warning message. This would also happen if you tried to add a SCSI device to an adapter that didn't exist.

In OpenServer, having the Bflag or Aflag set changes the flow of things at things slightly at this point. If either is set, then idbuild creates an mscsi file based on the installation kernel. This assumes several default values for your SCSI configuration.

After the idscsi program is finished, all the preparations for creating a new kernel in ODT 3.0 are finished. However, if you have OpenServer, there are a couple of things left to do. First we create the device registry using the idmkreg program. Next we a configuration file necessary for the kernel STREAM linker. This is done by the /etc/kslgen script.

At this point idbuild begins to compile and link the new kernel. Compile? Link? Don't you need a compiler to do that? Isn't the compiler part of the SCO Development System? Right on both counts. However, in order to allow kernel rebuild, there must be a compiler, linker and even assembler on the system. These programs are hard to find and even harder to use. They were designed with one task in mind: create a new kernel.

Experience has told me that there are people out there that are going to try to use these to create programs of their own. I say it three times: Don't! Don't! Don't! Using these programs to compile source code is not supported by SCO, so don't even bother calling them. They can't help you. As I said before these programs do one thing and that is to create the new kernel. They have nowhere near the complexity needed for an enduser to be able to use them to create orun-time applications. Besides, most you don't have any of the standard libraries, so the compiler is of very little use. If you need to write C programs, get a copy of the SCO Development System.

As a front end (maybe to hide things from foolish users ), idbuild calls idmkunix. This is the program that does all the compiling, assembling and linking. The most obvious part of the process are the files that were just created by idconfig. Since these need to be part of the kernel in order to have any effect, they must be first compiled, which creates the respective .o files.

These .o files are a few of the object modules, which go into the kernel. As I mentioned earlier, each driver exists as an object module in the appropriate directory in /etc/conf/pack.d (The Driver.o files). They too must be linked in. Each driver marked with a Y in the sdevice file is included.

Those of you who have already poked around may have found out that there are a few object files that find their way into the kernel which are not listed in mdevice and sdevice. These are the object files that compose the "kernel of the kernel and reside in /etc/conf/pack.d/kernel. It kind of makes sense that you don't need to tell the kernel to link in these files. It's possible to run without some of the "extras" like hard disk drivers. However, without the basic components of the operating system nothing runs.

In addition to the object files, the majority of the drivers have configuration files associated with them. These are the space.c files. Most of which are in the sub-directories of pack.d. Should a driver not need a configuration file, it still needs prototypes and include files. These find their way into the kernel by means of the stubs.c files, which are also in the sub-directories of pack.d.

Once the modules are linked to form the new kernel, idmkunix has finished work. It then exits and flow returns to idbuild, which then does a bit of clean-up. Some of the more observant reader may have noticed references to scodb among other things in the later part of idbuild. This portion of idbuild enables you to create a debuggable kernel. However, this is not important to you unless you are a developer of SCO specific device drivers. Besides, scodb is not a shipping product anyway.

At this point, idbuild creates a file called $ROOT/etc/.new_unix. Note that, as with most things, this is relative to the root of the kernel rebuild. If you have created a directory tree for testing, then this is relative to that root. If not, it is relative to /. The .new_unix file has no contents but simply serves as a flag that a new kernel has been created. Next, idmkdev is called. Since it is a script, we can take a look at it.

One of the first things idmkdev does is to check for the existence of the .new_unix file. If .new_unix doesn't exist, there is no need to continue and the script exits.

If .new_unix does exist, idmkdev begins the process of rebuilding the kernel environment. An important part of the kernel environment are the device nodes. Without them, the programs cannot access devices. (At least not well behaved ones) This is accomplished by the idmknod program, which idmkenv calls. Devices are created based on the entries in the node.d directory.

Let's stide-step for a moment and think about what we would normally see on the screen during the relink. One question you are asked is whether or not you what the kernel environment rebuilt. In this context the kernel's environment includes all the device nodes and the /etc/inittab file. Part of the process of rebuilding the kernel environment is the creation of a new /etc/inittab file. A temporary copy is created in /etc/conf/cf.d by combining init.base and the files in /etc/conf/init.d. This copy is then moved into /etc to replace the old inittab. If you choose not to rebuild the kernel environment, /etc/inittab will not get rebuilt.

At this point, the kernel rebuild is complete.

It is possible that during the installation of some 3rd party device driver the kernel relink failed miserably. If the rebuild failed, then possibly it never created the .new_unix file. So the corresponding section in idmkenv is never executed. However, the /etc/.unix_reconf file does still exist. It does not get removed until idmkenv reaches that section in .new_unix. If .unix_reconf exists, and there is a problem, idmkenv will enters the corresponding section and begin to clean-up. The majority of its work is removing temporary and .o files, but an important part of its mission is to remove the last driver that was attempted to be installed. This makes sense because it was probably the addition of the new driver that caused the relink to fail in the first place.

If the reason for the relink is simply changing kernel parameters, there is no need to rebuild the kernel environment. All the required device nodes are there and inittab doesn't need to be changed. Therefore, you can answer 'n' when asked whether the kernel environment needs to be rebuilt.

The final line of idmkenv is telinit q. This tells the system to re-examine the /etc/inittab file. The reason for this is that previously in idmkenv a new inittab was created. If changes had been made to any of the files (/etc/conf/cf.d/inittab, /etc/conf/init.d/*) then these changes would not take effect until the system is rebooted. By running telinit q the changes are immediate.

At this point, idmkenv exits and returns control back to idbuild. The last line of idbuild is simply to exit. Since idbuild had been called by link_unix with an exec, idbuild returns control to the process that called link_unix, which is either the user's shell, an installation script or one of the administration programs. Once you've rebooted, the changes you made will be a part of your new kernel.

Due to the complex nature of both device drivers and the kernel itself, you may have noticed that many issues were glossed over or skipped entirely. This is unfortunate, but necessary due to the limited amount of information you can pack into a book.

Key Link Kit Files

mdevice

The mdevice file is the device driver module description file. It contains descriptions of the characteristics of each of the device drivers. Let's take a look at part of an mdevice file. It consists of dozens of lines in nine columns. Table 0.6 contains a few lines of one mdevice file that we can use as reference.

sio

Iocrwip

iHctk

sio

0

5

1

100

-1

aud

Iocrwi

iocr

aud

0

21

0

0

-1

fd

Iocrwip

iHODbrcC

fd

2

2

1

2

2

ram

oc

ibk

ram

31

0

0

0

-1

vga

-

io

vga

0

0

0

1

-1

hd

hoc

irobCcGk

hd

1

1

1

1

-1

busmouse

Iocri

iHct

mous

0

16

1

2

-1

fp

-

iHor

fp

0

0

0

1

-1

ad

I

iHGt

aha

0

3

1

1

-1

nmi

I

ios

nmi_

0

0

0

1

-1

Table 0.6 Examples from /etc/conf/cf.d/mdevice

The first column is a mnemonic used by the kernel to identify that particular driver. In simpler terms, it is the internal name of that driver. This is the name used by the system during the relink and is what awk is pulling out. By convention, SCO documentation, such as the Device Driver Writer's Guide refers to drivers as xnamex. For more information on the naming scheme, take a look at <sys/cmn_err.h>.

The second column is a "function list" for the device. For example, if a device has an 'o' in this column, there is a function in it's driver to open that device. An example of a device of this type would be a hard disk (hd). If there is no device to be opened, there would be no 'o' here. For example, the driver to handle a Non-Maskable Interrupt (nmi) only has an initialization routine and there are no hardware specific functions to be performed when the device is opened. Therefore, there is no open routine in this driver and there is no 'o' in the second column. For details of what functions are possible and what each entry means, see the mdevice(F) man-page.

The third column is for the characteristics of that device. Is it a block device like a hard disk or a character device like a terminal? Is this a SCSI device? Does it support 32-bit addresses for DMA-transfers? Is this device even required? All these questions and many more are answered by the characteristics list. This, too, is detailed in the mdevice(F) man-page.

The fourth column is the handler prefix of the device. This is the external name for the device and the device driver handler routines associated with this device. This usually matches the first column. Since this is normally limited to four characters, it cannot match long names such as busmouse. By convention, the driver routines are referred to with the prefix (e.g. xx) followed by the routine name (e.g. read). Thus, xxread() would be a generic read function and xxwrite would be a generic write function. Since the prefix for the floppy driver is fl, the floppy read function would be flread(). (Here is an example where the driver prefix does not match the handler prefix.)

The fifth and sixth columns are the block and character major device numbers. Devices like hard disks and floppies can be accessed as either character or block devices. Therefore, there are entries in both columns of the hd and fd lines. The serial driver (sio) is only a character device, so there is no block major device number. Likewise, the ramdisk driver (ram) is only a block device, therefore has no character major device number. (For more information on major and minor numbers see the section on major and minor numbers.)

The seventh and eight columns are defined by the device driver. In the case of the Adaptec host adapter driver (ad) these are, respectively, the minimum and maximum number of devices that can be associated with the driver. Or the serial device driver (sio), this value is an offset into a table. In some cases, like the system auditing device (aud), these columns are ignored.

The last column is the DMA channel used by the device. Should a device not use DMA, like the vga driver, there is a -1 in this column.

sdevice

In theory, the "m-files", (mtune, mdevice) are the "master" files. These should be considered stagnant and unchanging. The "s-files" (stune, sdevice) are the "system" files and can be changed either directly or with the tools provided by the OS. These files reflect your system's configuration. One of the things contains in the files in sdevice.d,  is the software priority level (SPL). Since the SPL is in an "s-file", one would think that this is configurable. However, this should not be changed.

Each file in the sdevice.d directory has the same structure as the sdevice file. (Make sense, since the contents of the files in sdeviced are concatenated together to make sdevice. If we take a closer look at the files, we can learn details about how specific devices are configured. Let's look at the file /etc/conf/sdeviced/pa file, as an example. This is the file used to configure the parallel port driver. It should look something like this:

pa Y 1 2 4 7 378 37f 0 0

The first field on each line is the name of the device. The second field tells us whether or not the kernel should link in the driver for this device. A Y in this column means that the driver should be linked into the kernel. An N means that it shouldn't.

It should be noted that in many cases it is not sufficient to simply change this Y to an N and have everything work out right. Some devices are defined as required (With an r in the third column of mdevice). Changing a Y to an N for a required device causes the relink to halt rather abruptly.

Just like the seventh and eight column in mdevice, The third field in sdevice is driver specific.

The fourth field is the interrupt priority level, which is the same as the System Priority Level (SPL). A normal value is anything from 1-7. If there is a 0 in this column, the device does not have an interrupt routine.

The fifth field is the type of interrupt the device has. Should the device not have an interrupt routine, this will be 0 like the fourth field. If a driver does have an interrupt routine and the interrupt cannot be shared, there will be a 1 here. If the driver can share an interrupt, a number from 2-6 will be here. This number will be the same for all devices sharing that specific interrupt. The parallel driver has a 4 here, therefore it can share an interrupt with any other driver with a 4 in this column. Like the SPL, this value should not be changed.

Field six is the actual interrupt vector used by the driver. This is the IRQ that the controller is jumped or set to. If the fifth field contains a 0 (that is, no interrupt required), this field is ignored. Here we have a 7. As many of you may already know, this is the interrupt vector of the first parallel port.

If we take a look at the output of the hwconfig command or at the hardware screen during boot-up, there is usually at least one entry for a parallel port. If it's configured as lp0 (LPT1), then by default it sits at interrupt 7 and has a base address of 378 to 37f. Looking at the /etc/conf/sdevice.d/pa file, we see in the 6th through 8th column the exact same values. This is no coincidence. (NOTE: The lp driver is not linked in by default so there may not be one present on your system.)

The last two fields are the start and end of start controller memory address which are used by controllers that have internal memory such as some intelligent multi-port cards. Some ranges are not allowed. See the sdevice(M) man-page for more details.

mfsys and sfsys

The /etc/conf/cf.d/mfsys file is the configuration file for filesystem types that will be supported when the kernel is relinked. Like many of the other configuration files, mfsys contains a single line for each of the filesystems device drivers that will be included. The source for mfsys is the files in /etc/conf/mfsys.d. Like the mdevice-sdevice pair, which files from mfsys.d are used is based on the files in /etc/conf/sysfs.d. Each of these file contains a single line with the name of the filesystem to be included and a Y if it should be include and an N if it should not.

The first field is the the internal name for the filesystem type. Field 2 is the prefix to the handler functions in the fstypsw structure. The fstypsw structure is functionally the same as the device driver table, except we are accessing filesystem device drivers. Field 3 and 4 are flags that are used to fill the fsinfo data structure table entry. Field 5 is a bitmap describing what functions are present.

mtune and stune

As I mentioned before , 'm' files are the 'master' configuration files and the 's' files are those that define the current configuration for your system. So it is with /etc/conf/cf.d/mtune and /etc/conf/cf.d/stune. The mtune file is the master kernel tunable parameter file and stune contains the parameters that are different on your system. In ODT 3.0 most of these defined the absolute size of certain tables and values within the kernel. OpenServer introduce the concept of dynamically configurable parameters, so the parameters in OpenServer define the extent the tables can grow.

The format of the mtune file is:

parameter_name default minimum maximum

Unless changed by the stune file, the relink process takes the name and default value and creates the #define entries in /etc/conf/cf.d/config.c. If changed in stune, the system will verify that the value specified falls within the range defined by the minimum and maximum defined in mtune. If not, the system will complain during the relink and say that such changes can cause problems. Usually this message can be safely ignored.

In ODT 3.0 it was occasionally necessary to change the maximum value of some of the parameters since the maximum was just not high enough. Without making the change in mtune, the system would complain during the relink. In OpenServer things are a little better. Almost all of the parameters that had to have their maximums increased in ODT 3.0 are now configured dynamically. Therefore, they grow as the need grows. The list of such parameters can be found in mdevice under the heading:

* Redundant Parameters, required for backward compatibility

In most cases, there are new parameters that sort of take the place of the older ones. As I mentioned a moment ago, these new parameters define to what extent the tables can grow and not the absolute value or size.

node.d

The files in the /etc/conf/node.d directory are used by idmknod during a kernel relink to create the necessary device nodes (idmknod is called from within idmkenv). Here again, each line has single-line entries which define the devices to be installed. The fields in each line are:

devicename nodename devicetype minor owner group mode

The devicename is the same as the first column of the corresponding mdevice entry. It is through this connection that the idmknod is able to determine what the major number of the node should be.

The node name is what the device node will be called when it is created. In some cases (such as tab) the new device nodes are created in sub-directories of /dev. Therefore, the name of the sub-directory is prepended to the actual device name. In other words, these names are relative to the /dev directory. If they don't already exist, idmnod will create the necessary sub-directories.

The type of the file is what will eventual appear as the first entry in the permissions. If a b, the new node will be a block device node, a c mean a character device node, an l means a hard link and an s is for a symbolic link.

The minor field is what minor number will be used when this device is created. The owner, group and mode fields are all optional and define, respectively, the owner, group and permissions of the new node.

If the new node will be a link then the format of the line will be:

devicename nodename type oldname

Where oldname is the device node to which this new node should be linked.


/etc/conf/bin/

Programs and utilities used to create the new kernel

/etc/conf/bin/idaddld

Used to add or remove line disciplines.

/etc/conf/bin/idbuild

Main program to build a new kernel.

/etc/conf/bin/idcheck

Returns selected information for use when installing new drivers.

/etc/conf/bin/idconfig

Configuration for installable drivers and tunable parameters.

/etc/conf/bin/iddeftune

tune kernel parameters base on memory (OpenServer Only)

/etc/conf/bin/idinstall

Add, delete, update, or get device driver configuration data.

/etc/conf/bin/idmaster

Add, delete or update drivers in mdevice.

/etc/conf/bin/idmkenv

Rebuilds the kernel environment.

/etc/conf/bin/idmkinit

Read files containing specifications to create a new inittab.

/etc/conf/bin/idmknod

Removes nodes and reads specifications of nodes.

/etc/conf/bin/idmkreg

Make device registry file (OpenServer Only)

/etc/conf/bin/idmkunix

Configuration for installable drivers and tunable parameters.

/etc/conf/bin/idreboot

Reboot script for installable drivers. Forces user to reboot n a consistent manner.

/etc/conf/bin/idscsi

Configures tables for SCSI devices.

/etc/conf/bin/idspace

Verifies enough free space exists for a kernel rebuild.

/etc/conf/bin/idtune

Sets value of a tunable parameter.

/etc/conf/bin/idvdbuild

Makes console driver routine switch tables.

/etc/conf/bin/idvidi

Takes sdev and mvdev and makes appropriate information available

/etc/conf/bin/path_map.sh

Maps SSO directories (OpenServer Only)

/etc/conf/cf.d

Main configuration directory

/etc/conf/cf.d/conf.c

Contains function prototypes and structure initialization for the kernel and driver routines.

/etc/conf/cf.d/config.h

The header file for conf.c.

/etc/conf/cf.d/configure

Reconfigure the kernel.

/etc/conf/cf.d/direct

(OpenServer Only)

/etc/conf/cf.d/fsconf.c

File system configuration file

/etc/conf/cf.d/ifile

(OpenServer Only)

/etc/conf/cf.d/init.base

The base file used to create /etc/inittab.

/etc/conf/cf.d/initorder

Print the order in which xxinit() routines are called.

/etc/conf/cf.d/link_unix

Shell script to link the kernel.

/etc/conf/cf.d/majorsinuse

List of major device numbers currently in use.

/etc/conf/cf.d/mdev.hdr

Mdevice header information file.

/etc/conf/cf.d/mdevice

Device driver module descriptor file.

/etc/conf/cf.d/mevent

List of possible event devices on the system.

/etc/conf/cf.d/mfsys

Configuration file for filesystem types.

/etc/conf/cf.d/mscsi

SCSI device configuration file.

/etc/conf/cf.d/mtune

Kernel tunable parameter file.

/etc/conf/cf.d/mvdevice

Video driver backend configuration file.

/etc/conf/cf.d/routines

Finds driver entry points for each module.

/etc/conf/cf.d/sassign

File containing configurable default system devices

/etc/conf/cf.d/sdev.hdr

Sdevice header information file.

/etc/conf/cf.d/sdevice

Local device configuration file.

/etc/conf/cf.d/sevent

List of each event device on the system.

/etc/conf/cf.d/sfsys

Local filesystem type file. This file is composed of the individual files in /etc/conf/sfsys.d.

/etc/conf/cf.d/stune

Local tunable parameter file.

/etc/conf/cf.d vector.c

Contains interrupt vector table information.

/etc/conf/cf.d/vectorsinuse

Lists interrupt vectors currently in use.

/etc/conf/cf.d/vuifile

Defines kernel load values and lower level functions

/etc/conf/cf.d/xdevmap

Drivers utilizing multiple major numbers for extended minor numbers.

/etc/conf/init.d

Directory containing files added to /etc/conf/cf.d/init.base

/etc/conf/mfsys.d

Directory containing configuration information on the various types of filesystems.

/etc/conf/node.d

Directory containing list of device nodes to be created

/etc/conf/pack.d

Directories containing the actual device drivers

/etc/conf/pack.d/Diver.o

Object module for the respective driver

/etc/conf/pack.d/space.c

Configuration file for the respective driver

/etc/conf/pack.d/stub.c

Stub file if no space.c file.

/etc/conf/sdevice.d

Directory containing the component parts that go to form /etc/conf/cf.d/sdevice.

/etc/conf/sfsys.d

Directory containing local system information about each file system


Table 0.7 Components of the link kit

Extended Minor Numbers

So what happens if you have do have more hard disks than will fit within the 256 minor numbers? Well, if some are IDE/ESDI and the rest are SCSI, for example, this can still fit within the 256 values. This is because there are 256 minor numbers for each driver. So, if your boot drive is IDE and you have four or less SCSI drives, then the SCSI drives still can be represented by the 256 minor numbers

A problem arises when you have more than four drives of a single type. You then end up banging against the limitation of 256 minor numbers. The reason that there is this limitation is that the original hard disk sriver was designed to support ST-506 hard drives. (More on them later) Since you could have two controllers, each with two drives. The 256 values for the minor number were split evenly among the four possible drives. Two bits were then used to represent the possible hard disks. When support was added for ESDI, IDE and OMTI drives, that scheme still worked nicely.

Then along came SCSI and gummed up the works. Each host adapter could support up to seven drives and you could have multiple host adapters. The result was that the minor numbering scheme was no longer sufficient. That meant that prior to ODT 2.0 and SCO UNIX 3.2.4.0, you could only have four hard disks on your system. This was compounded by the fact that drives were much smaller "in those days."

One solution would have been to redesign the minor numbering scheme. Even taking into account the minor numbers used for the active partition, the whole disk, the "unused partition", you could still use up all the minor numbers. The solution that SCO hit upon was the introduction of extended minor numbers. In reality, these are nothing more than another set of minor number since they have a different major number. Internally, the system translates this "new" major number into the correct one.

When the mkdev hd script sees that you are installing the fifth SCSI drive, it use the configure utility to create the new major number and the appropriate extended minor numbers. Therefore, the system administrator doesn't need to do the work themselves. The command would look something like this:

./configure -m 1 -b -c -a -X 256

The -m option is used to create the extend minor number associated with the major number 1. The -b and -c options tell configure to create both a block and a character interface to the new driver. The -a is used to add the drives. Lastly, -X option indicates the offset for the new minor numbers. This is the key. When we get done, a new entry exists in mdevice. Although not next to each other in the file, the two entries for the hard disk might look like this:

hd

hoc

irobCcGk

hd

1

1

1

1

-1

hd

hoc

iobCcGkM

hd

79

79

1

256

-1

In our discussion on the mdevice, I said that the first column was the name of the driver. In both cases, this is hd as both of these entries refer to the same hd driver. (Remember that the hd driver is just a front-end to the real hard disk driver (i.e., wd0, Sdsk, etc). The second column was the function list. We see here that both drivers have the same functions. In the third column we see that the 'r' is missing from the second entry. This means that it is not required, although the first one is.

The next difference is the 'M' in the second entry. This tells the driver that this line is for the extended minor numbers. In turn, this flag changes the meaning of columns 7 and 8. Remember that these two columns are device dependent. In this case the seventh column is the "base" minor number, here 1, which we know is the hard disk driver. Column 8 is the offset.

Let's look at an example. Assume that we have a SCSI boot drive and we want to look at the first file system on the active partition of the fifth drive. Since the fifth drive must be included in the extended minor numbers, it must also have a different major number. Such an entry in the /dev directory might look like this:

br--r----- 1 root backup 79, 40 Sep 26 13:13 /dev/data5

When the kernel accesses this filesystem, it seems the major number 79, like it should. Looking its driver table, offset 79 is for the new hd driver with the extended minor numbers. The minor number here (40) is added to the offset to get the "absolute" minor number. He we have the offset (256) plus the minor number (40), or an absolute minor number of 296. In binary, this is represented by: 100101000.

You may have noticed that this is nine digits and not the eight we see with "normal" minor numbers. This makes sense since you can't represent 296 with eight bits. The interesting thing is that, for the most part, the minor numbering scheme is the same. The low order bits represent the division number, the next represent the partition and the high order represent the physical drive. graphically it looks like this:

1 0 0 X X X X X X

1 0 1 X X X X X X

1 1 0 X X X X X X

1 1 1 X X X X X X

\___/ \___/ \___/

| | |

| | division

| |

| partition

|

physical drive

In our example the binary representation was 100101000. The low order bits are 000, which is obviously 0. This is the 0th division on what ever partition. The next three bits (indicating the partition) are 101. This is bits 5 and 3 or 25 + 23 = 40. The high order bits are 100, which just has bit 9 set, or 256. This gives us 256 + 40 =-296.

We can also look at the high order bits separately. Taken by themselves, the bits 100 represent the number 4. Since we started counting at 0, this is the fifth hard disk. If we added another disk, these higher order bits would be 101, which would be the number 5 or the sixth driver. If we add a seventh and an eight driver, we would still have enough minor numbers . Once we got to the ninth drive, we would once again run out of minor numbers. No worries, the mkdev hd script catches that as well an creates a new set of extended minor numbers. The call to configure is the same, but mdevice will have a new entry. All three would look like this:

hd

hoc

irobCcGk

hd

1

1

1

1

-1

hd

hoc

iobCcGkM

hd

79

79

1

256

-1

hd

hoc

iobCcGkM

hd

82

82

1

512

-1

There are two things to note here. First, the minor numbers are not in order. Remember that the minor numbers are allocated based on the order the devices were installed. Since there is a gap between 79 and 82, this implies that other devices were added between the time the 5th drive was added and the 9th drive. The next thing to notice is column 8. In the third entry, the offset is now 512. So when we have a device node like this:

br--r----- 1 root backup 89, 40 Sep 26 13:13 /dev/data9

the kernel knows to add 512 to the 40, to give the absolute minor number. On this case, the binary representation of 552 is 1000101001. This has the necessary 10th bit to allow us to get numbers greater than 512. With the high order bits being 1000, this is the number 8, or the ninth drive.

You might think that this 10th digit (a result of the second set of extended minor) will allow you to add eight more drives. While this is correct mathematically, this is not the way the driver works. Before we added the second set, the maximum minor number we could have is 511. Adding 256 to this for the new offset gives us 767. In binary this is: 1011111111. In this case, the two high order bits (10) remain the same, only the two representing the driver number can change. These two bit mean only four new drives can be added. Although this might seem illogical, it does make adding extended minors much simpler to calculate.

So, how many drives can you have? Well, theoretically, there can be 255 major numbers. If you used all the remaining ones times the 4 for each extended minor number set. You could have several hundred drives. The only practical experience I have is while I was in SCO Support, someone managed to get about two dozens drives on the system. The limitation will probably be the number of slots in your machine. If you have six slots in your machine, and assuming that the video card, serial and parallel ports were built into the mother board, you could have six hosts adapters in your machine. This means 42 drives. Even if you have 10 slots, the 70 that's possible is well below the "hundreds" that are theoretically possible.

The Next Step

Now that you have an understanding of what goes into making the kernel and its environment, where should you go? My recommendation is going through this chapter again, this time sitting in from of a system. With the conception information gained by reading this chapter the first time, you can better understand how your system is configured. Look at the files I talked about. Compare the values you find to what is default or expect. Think about what those changes mean and what effects they have on your system.


1 Names changed to protect me from lawsuits


Next: File Systems and Files

Index

Copyright 1996-1998 by James Mohr. All rights reserved. Used by permission of the author.

Be sure to visit Jim's great Linux Tutorial web site at http://www.linux-tutorial.info/