Jim Mohr's SCO Companion

Index

Copyright 1996-1998 by James Mohr. All rights reserved. Used by permission of the author.

Be sure to visit Jim's great Linux Tutorial web site at http://www.linux-tutorial.info/


The Operating System and Its Environment

In this chapter we are going to go into some details about what makes up an SCO UNIX operating system. Here, I am not talking about the product SCO UNIX or either of the bundled products ODT or OpenServer. Here, I am talking strictly about the software that manages and controls your computer.

Since an operating system is of little use without hardware and other software. we are going to discuss how the operating system interacts with other parts of the ODT and OpenServer products We will also talk about what goes into to making the kernel, what components it is made of and what you can do to influence the creation of a new kernel.

Much of this information is far beyond what many system administrators are required to have for their jobs. So why go over it? Because what is required and what they administrator should know is two different things. Many calls I received while in SCO support could have been avoided by the administrator understanding the meaning of a message on the system console or the effects of making changes. By going over the details of how the kernel behaves, you will be in a better position to understand what is happening.

The Kernel - The Heartbeat of SCO UNIX

If any single aspect of the SCO product could be called "UNIX," then it would the kernel. Okay. So what is the kernel? Well, on the hard disk it is represented by the file /unix. (On SCO Open Server, this is a symbolic link to /stand/unix) Just as a program like /bin/date is a collection of bytes and isn't very useful until it is loaded in memory and running, the same applies to /unix.

However, once the /unix program is loaded into memory and starts its work it becomes "the kernel” and has many responsibilities. Perhaps the two most important are process management and file management. However, the kernel is responsible for many other things. One aspect is I/O management, which essentially the accessing of all the peripheral devices. The kernel is also responsible for security, which is ensuring that only those users authorized to gain access to the system and once in that they only do what they should.

Processes

From the user's perspective, perhaps the most obvious aspect is process management. This is the part of the kernel that ensures that each process gets its turn to run on the CPU. This is also the part that makes sure that the individual process don't go around "trouncing” on other processes by writing to areas of memory that belong to someone else. To do this, the kernel keeps track of many different structures that are maintained both on a per user basis as well as system wide.

As we talked about in the section on operating system basics, a process is the running instance of a program (a program simply being the bytes on the disks). One of the most powerful aspects of SCO UNIX is it's ability to not only keep many processes in memory at once, but to switch to them fast enough to make it appear as if they were all running at the same time.

As a process is running, it works within its context. It is also common to say that the CPU is operating within the context of a specific process. The context of a process are all of the characteristics, settings, values, etc which that particular program uses as it runs, as well as those that it needs to run. Even the internal state of the CPU and the contents of all its registers are part of the context of the process. When a process has finished having its turn on the CPU, and another process gets to run, the act of changing from one process to another is called a context switch.


Figure 0-1 Context Switch

We can say that a process' context is defined by its uarea (also called its ublock). The uarea contains information such as the effective and real UID, effective and real GID, system call error return value and a pointer to the system call arguments on that process's system stack.

The structure of the uarea is defined by the user structure in <sys/user.h>. There is a special part of the kernel's private memory that holds the uarea of the currently running process. When a context switch occurs, it is the uarea that is switched out. All other parts of the process remain where they are. The uarea of the next process is copied into the exact same place in memory as the uarea for the old process. This way the kernel does not have to make any adjustments and knows exactly where to look for the uarea. It will always be able to access the uarea of the currently running process by accessing the same area in memory.

One of the pieces of information that the uarea contains is the process' Local Descriptor Table (LDT). A descriptor is a 64 bit data structure that is used by the process to gain access to different parts of the system. That is, different parts of memory or different segments. Despite a common misunderstanding, SCO UNIX does use a segmented memory architecture. In older CPUs, segments were a way to get around memory access limitations. By referring to memory addresses as offsets within a given segment, more memory could be addressed, them if memory were looked at as a single block. The key difference with SCO UNIX is that each of these segments are 4GB and not the 64K they were originally.

The descriptors are held in descriptor tables. The LDT is used to keep keeps track of a process' segments, also called a region. That is, these are descriptors that are local to the process. The Global Descriptor Table (GDT) keeps track of the kernel's segments. Since there are many processes running, there will be many LDTs. These are part of the process' context. However, there is only one GDT as there is only one kernel.

Within the area (also called ublock) is a pointer to another key aspect of a process' context: its Task State Segment TSS. The TSS contains all the registers in the CPU. It is the contents of all the registers which defines the state that the CPU is currently running in. In other words, the registers say what a given process is doing at any given moment, Keeping track of these registers is obviously vital to the concept of multi-tasking.

By saving the registers in the TSS, you can reload them when this process gets its turn again and continue where you left off. This is because all of the registers are reloaded to their previous value. Once reloaded the process simply starts over where it left off as if nothing had happened. If you are curious about all that the TSS holds, take a look in <sys/tss.h>.

This brings up two new issues: system calls and stacks. A system call is a programming term for a very low-level function. These are functions that are "internal” to the operating system and are what are used to access the internals of the operating system, such as in the device drivers that ultimately access the hardware. Compare this to library calls, which are made up of systems calls.

A stack is a means of keeping track where a process has been. Like a stack of plates, objects are pushed onto the stack and popped off the stack. Therefore, things that are pushed onto the stack last are the first things popped off. When calling routines, certain values are pushed onto the stack for safe-keeping. These include the variables to be passed to the function, plus the location the system should return to after completing the function. When returning from that routine, these values are retrieved by popping them off the stack.

If you look in <sys/user.h> at the size of the system call argument pointer (u_ap) you will see it is only large enough to hold six arguments, each four bytes long. Therefore, you will never see a system call with more than six arguments.

Part of the uarea is a pointer to that process' entry in the process table. The process table, as its name implies, is a table containing information about all the processes on the system whether that process is currently running or not. Each entry in the process table is defined in <sys/proc.h>. The principle that a process may be in memory, but not actually running is important and we will get into more details about the life of a process shortly.

In ODT 3.0 and earlier, the size of this table was a set value and determined by the kernel parameter NPROC. Although you could change this value, you needed to build a new kernel and reboot for the change to take effect. If all the entries in the process table are filled and someone tries to start a new process, it will fail with the error message:

newproc - Process table overflow ( NPROC = x exceeded)

where x is the defined value of NPROC.

One nice thing is that if all but the last slot are taken up, only a process with the UID of root can take it. This prevents a process from creating more and more process and stopping the system completely. Thus, the root user has one last chance to stop it. OpenServer changed that with the introduction of dynamically configured parameters. Many of the parameters that had to be configured by hand, will now grow as the need for them increases.

Just how is a process created? Well, the first thing that happens that one process uses the fork() system call. Like a fork in the road it starts off as a single entity and then splits into two. When one process uses the fork() system call, an exact copy of itself is created in memory and the uareas are essentially identical. The value in each CPU register is the same, so both copies of this process are at the exact same place in their code. Each of the varaibles also have the exact same value. There are two exceptions: the process ID number and the return value of the fork() system call.


Figure 0-2 Creating a new process

Like users and their UID, each process is referred to by its process ID number, or PID. This is a unique number which is actually the process' slot number in the process table. When a fork() system call is made, the value returned by the fork() to the calling process is the PID of the newly created process. Since the new copy didn't actually make the fork() call, the return value in the copy is 0. This is how a process spawns or forks a child process. The process that called the fork() is the parent process of this new process, which is the child process. Note that I intentionally said the parent process and a child process. A process can fork many child processes, but has only one parent.

Almost always, a program will keep track of that return value and will then change its behavior based on that value. One of the most common things is for the child to issue an exec() system call. Although it takes the fork() system call to create the space that will be utilized by the new process, it is the exec() system call that causes this space to be overwritten with the new program.

At the beginning of every executable program is an area simply called the "header”. This header describes the contents of the file: how the file is to be interpreted. This could be information to tell the system that the file is a 286 or 386 binary, the size of the text and data segments or where the symbol table is. The symbol table is basically the translation from variable names that we humans understand to the machine language equivalents.

The header contains the locations of the text and data segments. As we talked about before, a segment is a portion of the program. The portion of the program that contains the executable instructions is called the text segment. The portion containing pre-initialized data is the data segment. Pre-initialized data are variables, structures, arrays, etc. that have their value already, set even before the program is run. The process is given descriptors for each of the segments. These descriptors are referred to as region descriptors as under SCO UNIX segments are more commonly referred to as regions.

In contrast to other operating systems running on Intel-based CPUs, SCO UNIX has only one region (or segment) each for the text, data and stack. The reason that I that I didn't mention the stack region until now, because the stack region is created when the process is created. Since the stack is used to keep track of where the process has been and what it has done, there is no need create it until the process starts.

Another region that I haven't talked about until now, is not always used. This is the shared data region. Shared data is an area of memory that is accessible by more than one process. Do you remember from our discussion on operating system basics when I said that part of the job of the operating system was to keep processes from accessing areas of memory that they weren't supposed to? So, what if they want to? What if they are allowed to? That is where the shared data region comes it.

If one process tells the other where the shared memory region is (by giving a pointer to it), then any process can access it. The way to keep unwanted processes away is simply not to tell them. In this way, each process that is allowed can use the data and the region only goes away when that last process disappears. How several processes would look in memory might look like Figure 0-3.


Figure 0-3 Process Regions

If we take a look at Figure 0-3, we see three processes. In all three instances, each process has it's own data a stack regions. However, process A and process B share a text region. That is, process A and process B have called the same executable off the hard disk. Therefore, they are sharing the same instructions. Note that in reality, this is much more complicated since the two process may be not be executing the exact same instructions at any given moment.

Each process has at least a text, data and stack region. In addition, each process is created in the same way. An existing process will (normally) use the fork()-exec() system call pair to create another process. However, this brings up an interesting question, similar to "Who or what created God?” If every process has to be created by another, then who or what created the first process?

When the computer is turned on, it goes through some wild gyrations that we will talk about later. At the end of the boot process the system loads and executes the /unix binary, the kernel itself. One of the last things the kernel does is to "force” the creation of a single process, which then become the great-grandparent of all the other processes.

This first, primordial processes is the sched process, also referred to as the swapper and has a PID of 0. Its function is to free up memory, by hook or by crook. If another process can spare a few pages it will take those. If not, sched may swap out an entire process to the hard disk. Hence the name. Sched is only context switched in when the amount of free memory is less than the running process needs.

The first created process is init, with a PID of 1. All other processes can trace their ancestry back to init. Init's job is basically to read the entries in the file /etc/inittab and execute different programs. One of the things it does is to start the getty program on all the login terminals, which eventually provides every user with their shell.

Another system process is vhand, whose PID is 2. This is the paging daemon or page stealer. If free memory on the system gets below a specific low water mark (the kernel tunable parameter GPSLO), then vhand is allowed to run every clock interrupt. (Whether it runs or not will depend on many things which we will talk about later) Until the amount of free memory gets above a high water mark (GPGSHI), vhand will contain to "steal" pages.

Next is bdflush, with a PID of 3. This is the buffer flushing daemon. It's job is to clean out any "dirty” buffers inside of the system's buffer cache. A dirty buffer is one that contains data that has been written to by a program, but hasn't yet been written to the disk. It is the job of bdflush to write this out to the hard disk (probably) at regular intervals. These intervals are defined by the kernel tunable parameter BDFLUSHR, which has a default of 30 seconds. The kernel tunable parameter NAUTOP specifies how long the buffer must have been dirty before it is "cleaned”. (Note that by "cleaning”, I don't mean that the data is erased. It is simply written to the disk.)

All processes, including the ones I described above, operate in one of two modes: user and system mode. In the section on the CPU we will talk about the privilege levels. An Intel 80386 and later has four privilege levels, 0-3. SCO UNIX uses only the two extreme most: 0 and 3. Processes running in user mode are running at privilege level 3 within the CPU. Processes running in system mode are running at privilege level 0. (More on this in a moment)


Figure 0-4 Process modes

In user mode, a process is executing instructions from within its own text segment, it references its own data segment and uses its own stack. Processes switch from user mode to kernel mode by making system calls. Once in system mode, the instructions executed are those within the kernel's text segment, the kernel's data segment is used, and a system stack is used within the process' uarea.

Although the process is going through a lot of changes when it makes a system call, keep in mind that this is not a context switch. This is still the same process. It's just that it is operating at a higher privilege.

The Life-Cycle of Processes

From the time a process is fork()'ed into existence, until the time it has completed it's job and disappears from the process table, it goes through many different states. The state a process is in changes many times during it's "life”. These changes can occur, for example, when the process makes a system call, it is someone else's turn to run, an interrupt occurs or the process asks for a resource that is currently not available.

A commonly used model shows processes operating in one of eight separate states. However, in the file <sys/proc.h> there are only seven separate processes listed. However, to understand the transitions better, the following eight states are used:


  1. executing in user mode

  2. executing in kernel mode

  3. ready to run

  4. sleeping in main memory

  5. read-to-run , but swapped out

  6. sleeping, but swapped out

  7. newly created, not ready to run and not sleeping

  8. issued exit system call (zombie)

The states listed here are intended to describe what is happening conceptually and not to indicate what "official" state a process is in. The official states are listed in table Table .0.1.

SLEEP

awaiting an event

SRUN

running

SZOMB

terminated but not waited for

SSTOP

process stopped by a debugger

SIDL

process being created

SONPROC

process is on the processor

SXBRK

process need more memory

Table .0.1 Process States

In my list of 8 states there was no mention of a processes actually being on the processor (SONPROC). Processing that are running in kernel mode or running in user mode are both in the SRUN state. Although there is no 1:1 match-up, hopefully you'll see what each state means as we go through the following description.

Figure 0-5 Process States

A newly created process enters the system in state 7. If the process is simply a copy of the original process (a fork but no exec), it then begins running in the state that the original process was (1 or 2). (Why none of the other states? It has to be running in order to fork a new process.) If an exec() is made, then this process will end up in kernel mode (2). It is possible that the fork()-exec() was done in system mode and the process never goes into state 1. However, this highly unlikely.

When running, an interrupt may be generated (more often than not this is the system clock) and the currently running process is pre-empted (3). This is the same state as state 3, since it is still ready-to-run and in main memory. The only difference is that the process just got kicked off the processor.

When the process makes a system call while in user mode (1), it moves into state 2 where it begins running in kernel mode. Assume at this point that the system call made was to read a file on the hard disk. Since the read is not carried out immediately, the process goes to sleep awaiting on the event that the system has read the disk and the data is ready. It is now is state 4. When the data is ready, the process is woken up. This does not mean it runs immediately, but rather it it once again ready to run in main memory (3).

If sched discovers that there is not enough memory for all the processes, it may decide to swap out one or more processes. The first choice is those that are sleeping, since they are not ready to run. Such a process now finds itself in state 6, where it is sleeping, but swapped out. It is also possible that there are no processes that are sleeping, but a lot of processes that are ready to run, so sched needs to swap one of those out instead. Therefore, a process could move from state 3 (ready to run) to state 5 (ready to run, but swapped out). However, as we'll see in a moment most process are sleeping.

If a process that was asleep is woken up (perhaps the data is ready, it moves from state 4 (sleeping in main memory) to state 3 (ready to run). However, a process cannot move directly from state 6 (sleeping, but swapped) to state 3 (ready to run). This requires two transitions. Since it is not effective to swap in process that are not ready to run, sleeping processes will not be swapped in. This is simply because the system has no way of knowing that that this process will be awoken soon, since swapping and waking a process are two separate actions. Instead, the system must first make the process ready to run, so it moves from state 6 (sleeping, but swapped) to state 5 (ready to run, but swapped). When the process is swapped back in, it returns to where it was. This can be either user mode (1) or kernel mode (2).

Processes can end their life by either explicitly calling the exit() system call or having it called for them. The exit() system call releases all the data structures that the process was using. One exception is the slot in the process table. This is the responsibility of the init process. The reason for hanging around is that the slot in the process table is used for the exit code of the exiting process. This can be used by the parent process to determine if the process did what it was supposed to or it ran into problems. The process shows that it has terminated by putting itself into the state 8, and it becomes a "zombie". Once here, it can never run again as nothing exists other than the entry in the process table.

This is the reason why you cannot "kill” a zombie process. There is nothing there to kill. In order to kill a process, you need to send it a signal (more on them later). Since there is nothing there to receive or process that signal, trying to kill it makes little sense. The only thing to do is to let the system clean up.

If the exiting process has any children, they are "inherited” by init. One of the values stored in the process structure is the PID of that process' parent process. This value is (logically) referred to as the parent process ID or PPID. When a process is inherited by init, the value of their PPID is changed to 1 (the PID of init).

A process's state change can cause a context switch in several different cases. One is when the processes voluntarily goes to sleep. This can happen when the process needs a resource that is not immediately available. A very common example is your login shell. You type in a command, the command is executed and you are back to a shell prompt. Between the time the command is finished and you input your next command a very long time could pass — at least two or three seconds.

Rather than constantly checking the keyboard for input, the shell puts itself to sleep waiting on an event. That event is an interrupt from the keyboard to say "Hey! I have input for you.” When a process puts itself to sleep, it sleeps on a particular wait channel (WCHAN). When the event occurs that is associated with that wait channel, every process waiting on that wait channel is woken up.

There is probably only one process waiting on input from your keyboard at any given time. However, many processes could be waiting for data from the hard disk. If so, there might be dozens of processes all waiting on the same wait channel. All are awoken when the hard disk is ready. It may be that the hard disk has read only the data for a subset of the processes waiting. Therefore, (if the program is correctly written) the processes check to see if their data is ready for them. If not, they put themselves to sleep on the same wait channel.

When a process puts itself to sleep, it is voluntarily giving up the CPU. It may be that this process had just started its turn when it noticed that it didn't have some resource it needed. Rather than forcing the other process to wait until the first one got its "fair share” of the CPU, that process is being nice and letting some other process have a turn on the CPU.

Because the process is being so nice to let others have a turn, the kernel is going to be nice to the process. One of the things the kernel allows is that a process which puts itself to sleep can set the priority that it will run at when it awakens. Normally, the kernel process scheduling algorithm calculates the priorities of all the processes. However, in exchange for voluntarily giving up the CPU, the process is allowed to choose it's own priority.

Process Scheduling

Scheduling processes is not as easy as finding the one that has been waiting the longest. Older operating systems used to do this kind of scheduling. It was referred to as "round-robin”. The processes could be thought of as sitting in a circle. The scheduling algorithm could them be though of as a pointer that moved around the circle getting to each process in turn.

The problem with this type of scheduling is that you may have ten processes all waiting for more memory. Remember from our discussion above that the sched process is responsible for finding memory for processes that don't have enough. Like other processes, sched needs a turn on the processor before it can do its work. If every processes had to wait until sched ran, then those that run right before sched may never run.

This is because sched would run and free up memory. By the time it got around to the processes just before sched, all the free memory would be taken. The processes that come after sched have to wait again. Next, sched runs again, frees memory, which is then take up by the other process.

Instead, SCO UNIX uses a scheduling method that takes many things into consideration, not just the actual priority of the process itself or how long it has been since it had a turn on the CPU. The priority is a calculated value, based on several factors. The first is what priority the process already has.

Another factor is recent CPU usage. The longer a process runs, the more it used the CPU. Therefore, to make it easier for faster jobs to get in and out quickly, the longer a process runs the lower it's priority. For example, a process that is sorting a large file needs a longer time to complete that the date command. If the process executing the date command is given a higher priority due to the fact that it is relatively fast, it leaves the process table quickly so there are less processes to deal with and every process gets a turn more often.

Keep in mind that the system has no way of knowing in advance just how long the date command will run. Therefore, the system can't give it a higher priority. However, by the fact that it gives the sort processes a lower priority after it has been running a while, the date command appears to run faster. I say "appear" since the date command needs the same number of CPU cycles to complete. It just gets them sooner.

SCO UNIX also allows you to be nice to your fellow processes. If you feel that your work is not as important as someone else's, you might want to considering being nice to them. This is done with the nice command. The syntax of which is:

nice <nice_value> <command>

For example, if you wanted to run the date command with a lower priority, it could be run like this:

nice -10 date

This has the effect of decreasing the start priority of the date command by 10. Note that only root can increase a process' priority, that is use a negative nice value. The nice value only affects running processes, but child processes inherit the nice value of their parent. By default processes started by users have a nice value of 20. This can be changed within sysadmsh in ODT 3.0 and with the usermod command in OpenServer. Therefore, using the nice command in the example above increases the start up priority to 30. OpenServer has provided a tool (renice) which allows you to change the priority of running process. See the appropriate man-pages for more details.

The formula used to determine a processes priority is:

priority = ( recent_cpu_usage /2) + nice_value + 40

Note that the value 40 is hard coded in the calculation. This is because the highest priority a user process can have is 40. Those of you who have been paying attention might have noticed something odd. If you add the recent CPU usage to the nice value, then add 40, the longer the process is on the CPU the higher the priority.

Well, sort of.

The numeric value calculated for the priority is the opposite of what we normally think of as priority. A better way if thinking about it is like the pull-down numbered tickets you get at the ice cream store. The lower the number, the sooner you get served. So it is for processes, as well.

Because of the way the priority is calculated, process scheduling in SCO UNIX has a couple of interesting properties. First, as time passes the priority of a process running in user mode will decrease. The decrease in priority means that it will be less likely to run the next time there is a context switch. Conversely, if a process hasn't run for a long time, the recent CPU usage is lower. Since it hasn't used the CPU much recently, its priority will increase. Such processes are more likely to run than those than just got their turn. (Assuming all other factors are the same) Figure 0-6 shows three processes as they run and their priorities. This will give you an idea of how priorities change over time.


Time (Secs)

Process A

Process B

Process C


priority

CPU

priority

CPU

priority

CPU

0



1

60

0

1

.

80

60

0

60

0

1



2

80

40

60

0

1

.

80

60

0

2



3

70

20

80

40

60

0

1

.

80

3



4

65

10

11

.

80

70

20

80

40

4




5

80

40

65

10

11

.

80

70

20

5



6

70

20

80

40

65

10

11

.

80

Figure 0-6 Changes in Process Priority

This table makes several large assumptions, which are probably not true for a "real” system. The first is that these are the only three processes running on your system. Since even in maintenance mode, you have at least four running, this is unrealistic. The second assumption is that all processes start with an initial priority of 40 and nothing has changed their nice value of 20. We also assume that there is nothing else on the system that would change the flow of things such as an interrupt from a hardware device.

Another assumption is that each process gets to run for a full second. The maximum time a process gets the CPU is defined by the MAXSLICE kernel tunable, which by default is 1 second. The number of times the clock interrupts per second, and therefore the numbers of times the priority is recalculated is defined by the HZ system variable. This is defined by default to be 100HZ, or 100 times a second. However, we are assuming that the priorities are only calculated once a second instead of 100 times.

If we look a process A, which is the first to run, we see that between second 0 and 1, its CPU usage went from 0 to the maximum of 80. When the clock tick occurred (the clock generated an interrupt), the priorities of all the processes are recalculated. Since our CPU usage is at 80, it is first decayed, then we cut it in half to get 40, add the nice value of that to the 40 that's hard coded to give 80 as the new priority for Process A. Since Processes B and C haven't run yet, their CPU time is 0. So we only add their nice value of 20 to the static value of 40 to keep them at 60.

Since Process A is now at priority 80, both Process B and Process C have higher priorities. So now let's say B runs. Between seconds 1 and 2, process B changed just like Process A did between seconds 0 and 1.

When the clock tick occurs, Process B's priority is calculated just like Process A's. Process B is now at 80. However, Process A had it's priority recalculated as well. It is now at 70.

All this time, Process C was not on the CPU, therefore its priority hasn't changed from the original 60. Since this is the lowest value, it has the highest priority and now gets a turn on the CPU. When its turn is finished (at the end of second 2) and priorities are recalculated, Process C is now at 80. Process B is at 70, but Process A has been recalculated to be 65. Since this is the lowest, Process A get a chance to run again.

In reality, things are not that simple. There are dozens of processes competing for the CPU. There are different start-up priorities, different nice values and different demands on the system. These demands include requests for services from peripherals such as hard disks. Responding to these requests can almost instantly change which process is running.

Interestingly enough, sudden changes in who is on the CPU is, in part, due to the priority. However, all this time I intentionally avoided mentioning the fact that regardless of what a process' priority is, it will not run unless it is in the run queue. This is the state of SRUN. A process in the run queue does not mean that it is running, just that it can run if it has the highest priority.

Remember when I said that user processes can never have a lower priority value than 40? Well, in case you hadn't guessed (or didn't already know) system processes like sched, vhand and init almost exclusively operate with priority values less than 40. Well, if they always have a lower priority why isn't that they always get to run? Simple. They aren't always in the run queue.

Interrupts, Exceptions and Traps

Normally, processes like sched, vhand and init, as well as most user processes are sleeping waiting on some event. When that event happens, these processes are called into action. Remember, it is the responsibility of the sched process to free up memory when a process runs short of it. So, it is not until memory is needed that sched starts up. How is it that sched knows?

In chapter 1, we talked about virtual memory and I mentioned page faults. When a process makes reference to a place in its virtual memory space that does not yet exist in physical memory, a page fault occurs.

Faults belong to a group of system events called exceptions. An exception is simply something that occurs outside of what is normally expected. Faults (exceptions) can occur either before or during the execution of an instruction.

For example, if an instruction needs to be read that is not yet in memory, the exception (page fault) occurs before the instruction starts being executed. On the other hand, if the instruction is supposed to read data from a virtual memory location that isn't in physical memory, the exception occurs during the execution of the instruction. In cases like these, once the missing memory location is loaded into physical memory, the CPU can start the instruction.

Traps are exceptions that occur after an instruction has been executed. For example, attempting to divide by zero will generate an exception. However, in this case it doesn't make sense to restart the instruction since every time we to try to run that instruction, it still comes up with a Divide-by-Zero exception. That is, all memory references are read in before we start to execute the command.

It is also possible for processes to generate exceptions intentionally. These programmed exceptions are called software interrupts.

When any one of these exceptions occurs, the system must react to the exception. In order to react, the system will usually switch to another process to deal with the exception. This means a context switch. In our discussion of process scheduling, I mentioned that every clock tick the priority of every process is recalculated. In order to make those calculations, something other than those processes have to run.

In both ODT 3.0 and OpenServer the system timer (or clock) is programmed to generate a hardware interrupt 100 times a second. (This is defined by the HZ system parameter) The interrupt is accomplished by sending a signal to a special chip on the motherboard called an interrupt controller. (We go into more details about these in the chapter on hardware) The interrupt controller then sends an interrupt to the CPU. When the CPU gets this signal it knows that the clock tick has occured and it jumps to a special part of the kernel that handles the clock interrupt. Scheduling priorities are also recalculated within this same section of code.

Because the system might be doing something more important when the clock generates an interrupt, there is a way to turn them off. In other words, there is a way to mask out interrupts. Interrupts that can be masked out are called maskable interrupts. An example of something more important than the clock would be accepting input from the keyboard. This is why clock ticks are lost on systems with a lot of users, inputting a lot of data. As a result the system clock appears to slow down over time.

Sometimes events occurs on the system that you want to know about no matter what. Imagine what would happen if memory was bad. If the system was in the middle of writing to the hard disk when it encountered the bad memory, the results could be disastrous. If the system recognizes the bad memory, the hardware generates an interrupt to alert the CPU. If the CPU was told to ignore all hardware interrupts, it would ignore this one. Instead, the hardware has the ability to generate an interrupt that cannot be ignored or masked out. This is called a non-maskable interrupt. Non-maskable interrupts are generically referred to as NMIs.

When an interrupt or an exception occurs, it must be dealt with to ensure the intergrity of the system. How the system reacts depends on whether it was an exception or interrupt. In addition, what is done when the hard disk generates an interrupt is going to be different than when the clock generates one.

Within the kernel is the Interrupt Descriptor Table (IDT). This is a list of descriptors (pointers) that point to the functions that handles the particular interrupt or exception. These functions are called the interrupt or exception handlers. When an interrupt or exception occurs, it has a particular value called an identifier or vector. Table 0.2 contains as list of the defined interrupt vectors. For more information see <sys/trap.h>.


Identifier

Description

0

Divide error

1

Debug exception

2

Nonmaskable interrupt

3

Breakpoint

4

Overflow

5

Bounds check

6

Invalid opcode

7

Co-processor not available

8

Double fault

9

(reserved)

10

Invalid TSS

11

Segment not present

12

Stack exception

13

General protection fault

14

Page fault

15

(reserved)

16

Co-processor error

17

alignment error (80486)

18-31

(reserved)

32-255

External (HW) interrupts

Table 0.2 Interrupt Vectors

The reserved identifiers currently are not used by the CPU, but are reserved for possible future use. Interrupts that come from one of the interrupt controllers the kernel assigns to the identifiers 64 through 79. Identifiers 32-63 and 80-255 are not currently used by SCO UNIX. These identifiers are often referred to as vectors and the interrupt descriptor table (IDT) is often referred to as the interrupt vector table.

What these numbers really are indices into the IDT. When an interrupt, exception or trap occurs, the system knows which number corresponds to that event. It then uses that number as an index into the IDT, which in turn points to the appropriate area of memory for handling the event. What this looks like graphically is shown in Figure 0-7.

It is possible for devices to share interrupts. That is, there are multiple devices on the system that are configured to the same interrupt. In fact, there are certain kinds of computers that are designed to allow devices to share interrupts (we'll talk about them in the hardware section). If the interrupt number is an offset into a table of pointers to interrupt routines, how does the kernel know which one to call?

Well, as it turns out there are two IDTs; one for shared interrupts and one for non-shared interrupts. During a kernel relink (more on that later), the kernel determines if the interrupt is shared or not. If it is, it places the pointer to that interrupt routine into the shared IDT. When an interrupt is generated, the interrupt routine for each of these devices is called. It is up to the interrupt routine to check to see if the associated device really generated an interrupt or not. The order that they are called is the order that they are linked in.

When an exception happens in user mode, the process passes through something called a trap gate. At this point, the CPU no longer uses the process' user stack, but rather the system stack within that process' uarea (each uarea has a portion set aside for the system stack). At this point, that process is operating in system (kernel) mode. That is, at the highest privilege level, 0.

Before the actual exception can be handled, the system needs to ensure that the process can return to the place in memory where it was when the exception occured. This is done by a low level interrupt handler. Part of what it does is to push (copy) all of the general purpose registers onto the process' system stack. This makes them available again when the process goes back to using the user stack.

The low level interrupt handler also determines whether the exception occurred in user mode or system mode. If the process was already in system mode when the exception occurred, there is no need to push the registers onto the process' system stack, as this is the stack that the process is already using.


Figure 0-7 First-Level Interrupt Handler

The kernel treats interrupts very similarly to the way it treats exceptions. All of the general purpose registers are pushed onto the system stack and a common interrupt handler is called. The current interrupt priority is saved and the new priority is loaded. This prevents interrupts at lower priority levels from interrupting the kernel as it is handling this interrupt. Then the real interrupt handler is called.

Since an exception is not fatal, the process will return from whence it came. It is possible that immediately upon return from system mode, a context switch occurs. This might be the result of an exception with a lower priority. Since it could not interrupt the process in kernel mode, it had to wait until it returned to user mode. Since the exception has a higher priority than the process when it is in user mode, a context switch occurs immediately after the process returns to user mode.

If another exception occurs while the process is in system mode, this is not a normal occurrence. Exceptions are the result of software events. Even a page fault can be considered a software event. Since the entire kernel is in memory all the time, a page fault should not happen. When a page fault does happen when in kernel mode, the kernel panics. There are special routines built into the kernel to deal with the panic to help the system shutdown as gracefully as possible. Should something else happen that causes another exception while the system is trying to panic, a double panic occurs.

This may sound confusing as I just said that a context switch could occur as the result of another exception. What this means is that the exception occurred in user mode, so there needs to be a jump to system mode. This does not mean that the process continues in system mode until it is finished. It may (depending on what it is doing) be context switched out. If another process has run before the first one gets its turn on the CPU again, that process may generate the exception.

There are a couple of cases where exceptions in system mode do not cause panics. The first is when you are debugging programs. In order to stop the flow of the program, exceptions are raised and you are brought into special routines within the debugger. Since exceptions are expected in such cases, it doesn't make sense to have the kernel panic.

The other case is when page faults occur as data is being copied from kernel memory space into user space. As I mentioned above, the kernel is completely in memory. Therefore, the data will have to be in memory to copy from the kernel space. However, it is possible that the area that the data needs to be copied to is not in physical memory. Therefore a page fault exception occurs, but this should not cause the system to panic.

Unlike exceptions, it is possible that another interrupt occurs while the kernel is handling the first one (and therefore is in system mode). If the second interrupt has a higher priority than the first, a context switch will occur and the new interrupt will be handled. If the second interrupt has the same or lower priority, then the kernel will "put it on hold”. These are not ignored, but rather saved to be dealt with later.

Signals

Signals are a way of sending simple messages to processes. Most of these messages are already defined and can be found in <sys/signal.h>. However, signals can only be processed when the process is in user mode. If a signal has been sent to a process that is in kernel mode, it is dealt with immediately upon returning to user mode.

Many signals have the ability to immediately terminate a process. However, most of these signals can be either ignored or dealt with by the process itself. If not, the kernel will take the default action specified for that signal. You can send signals to processes yourself by means of the kill command as well as the delete key and Ctrl-/. However, you can only send signals to processes that you own. Or, if you are root, you can send signals to any process.

It's possible that the process that you want to send the signal to is sleeping. If that process is sleeping at an interruptable priority, then the process will awaken to handle the signal. Processes sleeping at an priority of 25 or less are not interruptable. The priority of 25 is called PZERO.

The kernel keeps track of pending signals in the p_sig entry in each process' process structure. This is a 32-bit value, where each bit represents a single signal. Since it is only one bit per signal, there can only be one signal pending of each type. If there are different kinds of signals pending, the kernel has no way of determining which came in when. It will therefore process the signals starting at the lowest numbered signal and moving up.

System Calls

If you are a programmer, you hopefully know what a system call is and have used them many times in your programs. If you are not a programmer, you may not know what they are, but you still use them thousands of times a day. All "low-level” operations on the system are handled by system calls. These include such things as reading from the disk or printing a message on the screen. System calls are the user's bridge between user space and kernel space. This also means that it is the bridge between a user application and the system hardware.

Collections of system calls are often combined into more complex tasks and put into libraries. When using one of the functions defined in a library you call a library function or make a library call. Even when the library routine is intented to access the hardware, it will make a system call long before the hardware is touched.

Each system call has its own unique identifying number that can be found in <sys.s>. The kernel uses this number as an index into a table of system call entry points. These are pointers to where the system calls reside in memory along with the number of arguments that should be passed to them.

When a process makes a system call, the behavior is similar to that with interrupts and exceptions. Entry into kernel space is made through a call gate. There is a single call gate which serves as a guardian to the sacred area of kernel space. Like exception handling, the general purpose registers and the number of the system call are pushed onto the stack. Next, the system call handler is invoked, which calls the routine within the kernel that will do the actual work.

After entering the call gate, the kernel system call dispatcher "validates" the system call and hands control over to the kernel code that will actually perform the requested function. Although there are hundreds of library calls, each of these will call one or more systems calls. In total, there are about 150 system calls, all of which have to pass through this one call gate. This ensures that user code moves up to the higher privilege level at a specific location within the kernel (a specific address). Therefore, uniform controls can be applied to ensure that a process is not doing something it shouldn't.

When the system call is complete, the system call dispatcher returns the result of the system and status codes (if applicable). As with interrupts and exceptions, the system checks to see if a context switch should occur upon the return to user mode. If so, a context switch takes place. This is possible in situations where one process made a system call and an interrupt occurred while the process was in system mode. The kernel then issued a wakeup() to all processes waiting for date from the hard disk.

When the interrupt completes, the kernel may go back to the first one that made the system call. But, then again, there may be another one with a higher priority.

Paging and Swapping

In chapter 1, we talked about how the operating system uses capabilities the CPU to make it appear as if you have more memory than you really do. This is the concept of virtual memory. In chapter 12, we'll go into details about how this is accomplished. That is, how the operating system and CPU work together to keep up this illusion. However, in order to make this section on the kernel complete, we need to talk about this a little from a software perspective.

One of the basic concepts in the SCO UNIX implementation of virtual memory is the concept of a page. A page is a 4Kb area of memory and is the basic unit of memory that both the kernel and the CPU deal with. Although both can access individual bytes (or even bits), the amount of memory that is managed is usually in pages.

If you are reading a book, you do not need to have all the pages spread out on a table for you to work effectively. Just the one you are currently using. I remember many times in college when I had the entire table top covered with open books, including my notebook. As I was studying I would read a little from one book, take notes on what I read and if I needed more details on that subject, I would either go to a different page or a completely different book.

Virtual memory in SCO UNIX is very much like that. Just as I only need to have open the pages I was currently working with, a process needs only to have those pages in memory that it is working with. Like me, if the process needs a page that is not currently available (not in physical memory), it needs to go get it (usually from the hard disk).

If another student came along and wanted to use that table, there might be enough space for him or her to spread out books as well. If not, I would have to close some of my books (maybe putting book marks at the pages I was using). If another student came along or the table was fairly small, I might have to put some of the books away. SCO UNIX does that as well. If the text books represent the unchanging text portion of the program and the notebook represents the changing data, things might be a little clearer.

It is the responsibility of both the kernel and the CPU to ensure that I don't end up reading someone else's textbook or writing in someone else's notebook. That is, they ensure that one process does not have access to the memory locations of another process (a discussion of cell replication would look silly in my calculus notebook). The CPU also helps the kernel by recognizing when the process tries to access a page that is not yet in memory. It is the kernel's job to figure out which process it was, what page is was and to load the appropriate page.

It is also the kernel's responsibility to ensure that no one process hogs all available memory. Just like the librarian telling me to make some space on the table. If there is only one process running (not very likely) then there may be enough memory to keep the entire process loaded as it runs. More likely is the case where there are dozens of processes in memory and each gets a small part of the total memory. (Note, depending on how much memory you have, it is still possible that the entire program is in memory.)

Processes generally adhere to the principle of spatial locality. This means that over a short period of time, processes will access the same portions of their code over and over again. The kernel could establish a working set of pages for each process. These are the pages that have been accessed with the last n memory references. If n is small, then processes may not have enough pages in memory to do their job. Instead of letting processes work, the kernel is busy spending all of its time reading in the needed pages. By the time the system has finished reading in the needed pages, it is some other processes turn. Now, some other process needs more pages, so the kernel needs to read them in. This is called thrashing. Large values of n may lead to cases where there is not enough memory for all the processes to run.

However, SCO UNIX does not use the working set model, but does use the concept of a window. When the amount of available (free) memory drops below a certain point (which is configurable), the vhand process is woken up and put on the run queue. Using this windows does not mean that thrashing cannot occur on an SCO system. When memory gets so full of user processes that the system spends more time freeing up memory and swapping processes in and out, even SCO will thrash.


Figure 0-8 sched and vhand working together

As I mentioned above, the number of free pages is checked at every clock tick. If this gets below the value set by the kernel tunable GPGSLO, vhand, the "page stealer” process, is woken up. When it runs, vhand searches for pages that have not been recently accessed by a process.

If a page has not been referenced within a pre-determined time, it is "freed” by vhand and added to a list of free pages. In order to keep one area from having pages stolen more often than others, vhand remembers where it was and starts with a different area the next time it runs. When vhand has freed up enough pages so there are more than GPGSHI pages available, it puts itself back to sleep. This is the reason why vhand does not always run, even though it has a high priority.

If a page that vhand steals is part of the executable portion of a process (the text), it can easily free up the page, since it can always get it back from the hard disk. However, if the page is part of the process' data this may be the only place that data exists. The simplest solution is to just say that data pages cannot be stolen. This is a problem, since you may have a program that needs more data than there is physical memory. Since you cannot keep it all in memory at the same time, you have to figure out a better solution.

The solution is to use a portion of hard disk as a kind of temporary storage for data pages that are not currently needed. This area of the hard disk is called the swap space or swap device and is a separate area used solely for the purpose of holding data pages from programs. Copying pages to the swap space is the responsibility of the system process swapper.

The size and location of the swap device is normally set when the system is first installed. Afterwards, more swap space can be added if needed. (This is done with the swap command.) Occupied and free areas of the swap device are managed by a map, where a zero value says the page on the swap device is free and non-zero value is the number of processes sharing that page.

If the system is short of memory and pages need to be swapped out, the swapper process needs to determine just what processes can be swapped out. It first looks for processes in the states of either SSLEEP (process is sleeping) or SSTOP. (Process is stopped by a debugger) If there is only one, the choice is easy, if not, the swapper needs to calculate which one has the lowest priority.

Often times, there are no processes that fit these criteria. Although this usually happens only on systems that are heavily loaded, the system needs to take into account cases where there are no processes in either SSLEEP or SRUN. Therefore, the swapper needs to look elsewhere to find a process to swap out. It then considers processes in the state SRUN (read-to-ready) or SXBRK (needs more memory) and tries to find the processes with the lowest priority in these states.

If the swapper is trying to make room for a process that is already swapped out, then that process must have been on the swap device for at least two seconds. The two second threshold is to keep the system from spending all of its time thrashing and not getting any work done.

If we find a suitable process to swap out, sched locks the process in memory. Hmmm. Why lock a process into memory that we are just going to swap out? Remember that sched is just another process. It could be context switched out if an interrupt (or something else) occurs. What happens if sched gets context switched out, but vhand runs before the swapper gets a chance to run again? It could happen that vhand steals pages from the first process. When vhand gets switched back in, the pages it wanted to steal may already be gone. Either sched has to again figure out what pages are to be swapped out, or vhand steals pages that sched just brought in. Instead it locks the pages in memory.

Next, space is allocated on the swap device for the process' uarea. If there is no space available, the swapper generates an error indicating swap space is running low and will try to swap out other parts of the process. If it can't, the system will panic with an "out of swap” message. Remember that a panic is when something happens that the kernel does not know how to deal with. Since the kernel has a process that needs to run and it cannot make more memory available by swapping out a process, the kernel doesn't know what to do. Therefore it panics.

All regions of the process are checked. If a region is locked into memory it will not be swapped. If a region is for private use, such as data or stack, all of the region will be swapped out. If the region is shared, only the unreferenced pages will be swapped. That is, only those pages that have not been referenced within a certain amount of time wiith be swapped.

Note that swapping may not require a physical write to the swap device. This is due to the fact that once an area is allocated for a process, it remains so allocated until the process terminates. Therefore, it can happen that a page is swapped back in to be read, but never written to. If that page later needs to be swapped out again, there is no need to swap it out as it already exists in the correct state on the hard disk.

Eventually the process that got swapped out will get a turn on the CPU and will need to be swapped back in. Before it can be swapped back in, the swapper needs to ensure that there is enough memory for at least the uarea and a set of structures called page tables. Pages tables are an integral part of the virtual memory scheme and point to the actual pages in memory. We talk more about this when we talk about the CPU in the hardware section.

If there isn't enough room, sched looks for the process that has been waiting for memory the longest (SXBRK) and tries to allocate memory for that process. If there are none in this state, sched goes through the same procedures as it does for processes that are already in main memory. It's possible for sched to make memory available to other processes in addition to the original one. Up to ten processes in the SXBRK state can have memory allocated and up to five can be swapped back in during a pass of the swapper.

Often you don't want to swap in certain process. For example, it doesn't make sense to swap in a process that is sleeping on some event. Since that event hasn't occurred yet, swapping it in means that it will just need to go right back to sleep. Therefore, only processes in the SRUN state are eligible to be swapped back in. That is, only the processes that are runnable are swapped back in. In addition, these processes must have a priority less than 60. If it has a higher priority value (therefore a lower priority) the odds are that there are other processes that will be run first. Since you are already swapping, it is more than likely that this process will be swapped out again.

When being swapped back in, the pages that are not page of the uarea or page tables are left on the swap device. It is not until the process actually needs them that they will be swapped back in. This will happen in pretty short order, since anything the process wants to do will cause a page fault and cause pages to be swapped back in.

Keep in mind that accessing the hard disk is hundred of times slower than accessing memory. Although swapping does allow you to have more programs in memory than the physical RAM will allow, using it does slow down the system. If possible, it is a good idea to keep from swapping by adding more RAM.

Processes in Action

If you are like me, knowing how things work in theory is not enough. You want to see how things are working on your system. SCO provides several tools for you to watch what is happening. The first is perhaps the only one that the majority of users have ever seen. This is the ps command, which gives you the process status of particular processes. Depending on your security level normal users can even look at every process on the system. (The must have the mem subsystem privilege).

Although users can look at processes using the ps command, they cannot look at the insides of the processes themselves. This is because the ps command is simply reading the process table. This contains only the control and data structures necessary to administer and manager the process and not the process itself. Despite this, using ps can not only show you a lot about what your system is doing, but it can give you insights into how the system works. Because much of what I will talk about is documented in the ps man-page, I want to suggest in advance that you take a look there for more details.

If you start ps from the command with no options, the default behavior is to show the processes running on our current terminal, something like this:

PID

TTY

TIME

CMD

608

ttyp0

00:00:02

ksh

1147

ttyp0

00:00:00

ps


This shows us the process ID (PID), the terminal that the process is running on (TTY), the total amount of time the process has had on the CPU (TIME) and the command that was run (CMD). If the process had already issued an exit(), but hadn't finished it yet by the time the ps read the process table, we would probably see <defunct> in this column.

Although this is useful in many circumstances, it doesn't say much about these processes, Let's see what the long output looks like. This is run as ps -l.


F

S

UID

PID

PPID

C

PRI

NI

ADDR

SZ

WCHAN

TTY

TIME

CMD

20

S

0

608

607

1

73

24

fb11b9e8

140

fb11b9e8

ttyp0

00:00:02

ksh

20

O

0

1172

608

14

42

24

fb11c0a0

184

-

ttyp0

00:00:00

ps

Now this output looks a little better. At least there are more entries, so maybe it is more interesting. The columns PID, TTY, TIME and CMD are the same as in the previous output.

The first column (F) are flags in octal to tell us some information about the state of the process. For example, a 01 here would be for a system process that is always in memory, such as vhand or sched. The 20 in both cases here mean that the process is in main memory.

The S column is one of the "official" states that the process can be in. These states are defined in <sys/proc.h> and can be one of the following values:


  • O

Process is currently on the processor (SONPROC)

  • S

Sleeping (SSLEEP)

  • R

Ready to run (SRUN)

  • I

Idle, being created (SIDL)

  • Z

Zombie state (SZOMB)

  • T

Process being traced, used by debuggers (SSTOP)

  • B

Process is waiting for more memory (SXBRK).

Here we see that the ksh process (line 1) is sleeping. Although we can't tell from the output, I know that the event, which is waiting on is the completion on the ps command. One indication I have is the PID and PPID columns. These are, respectively, the Process ID and Parent Process ID. Notice that the PPID of the ps process is the same as the PID of the ksh process. This is because I started the ps command from the ksh command line and the ksh had to do a fork()-exec() to start up the ps. This makes ps  a child process of the ksh. Since I didn't start the ps in the background, I know the ksh is waiting on the completion of the ps. (More on this in a moment)

We see that the ps process is on the processor (state O-SONPROC). As a matter of fact, I have never run a ps command where ps was not on the processor. Why? Well, the only way for ps to read the process table is to be running and the only way for a process to be running is to be on the processor.

Since I just happened to be running these processes as root, the UID column, which shows the User ID of the owner of that processes, there is 0 in this column. The owner is almost always the user that started the process. However, you can change the owner of a process by using the setuid() or the seteuid() system call.

The C column is an estimate of recent CPU usage. Using this value, combined with the process' priority (the PRI column) and the nice value (the NI column), sched calculates the scheduling priority of this process. The ADDR column is the virtual address of that process' entry in the process table. The SZ column is the size (in kilobytes) of the swappable portion of the process' data and stack.

The WCHAN column is the Wait CHANnel for the process. This is the event that the process is waiting on. Since ps is currently running, it is not waiting on an any event. Therefore, there is a dash in this column. The WCHAN that the ksh is waiting on is fb11b9e8. Although, I have nothing here to prove it, I know that this event is the completion of the ps.

Although I can't prove this, I can make some inferences. First, let's look at the ps output again, this time let's start ps in the background. This gives us the output:


F

S

UID

PID

PPID

C

PRI

NI

ADDR

SZ

WCHAN

TTY

TIME

CMD

20

S

0

608

607

3

75

24

fb11b9e8

132

f01ebf4c

ttyp0

00:00:02

ksh

20

O

0

1221

608

20

37

28

fb11cb60

184

-

ttyp0

00:00:00

ps

Next, let's make use of the ability of ps to display the status of processes running on a specific terminal. We can then run the ps command from another terminal and look at what's happening on ttyp0. Running the command

ps -lt ttyp0

we get something like this:


F

S

UID

PID

PPID

C

PRI

NI

ADDR

SZ

WCHAN

TTY

TIME

CMD

20

S

0

608

1295

3

75

24

fb11b9e8

132

f01ebf4c

ttyp0

00:00:02

ksh

In the first example, ksh did a fork-exec, but since we put it in the background it returned to the prompt and didn't wait for the ps complete. Instead it was waiting for more input from the keyboard. In the second example, ksh did nothing. I ran the ps from another terminal and it showed me only the ksh. Looking back at that screen, I see that it is sitting there, waiting for input from the keyboard. Notice that in both cases the WCHAN is the same. Both are waiting for the same event: input from the keyboard. However, in the previous example we did not put the command in the background so the WCHAN was the completion of ps.


Despite it's ominous name, another useful tool is crash. Not only can you look at processes, but you can also look at many other things such as file tables, symbol tables, region table, both the global and local descriptor tables, and even the uarea of a process. Because crash needs read access to both /dev/mem and /unix, it can only be run by root. This is because the crash command needs to be able to read /dev/mem. If a normal user were allowed to read /dev/mem, they could read another user's process.

In the chapter on system monitoring, we'll talk about crash and all the things it can tell us about out system.

Devices and Device Nodes

In

UNIX nothing works without devices. I mean NOTHING. Getting input from a keyboard or displaying it on your screen both require devices. Accessing data from the hard disk or printing a report also require devices. In an operating system like DOS, all of the input and output functions are almost entirely hidden from you. Drivers for these devices must exists in order to be able to use them, however they are hidden behind the cloak of the operating system.

Although accessing the same physical hardware, device drivers under UNIX are more complex than their DOS cousins. Although adding new drivers is easier under DOS, SCO provides more flexibility in modifying the ones you have. SCO UNIX provides a mechanism to simplify adding these input and output functions. There is a set of tools and utilities to modify and configure your system. These tools are collectively called the Link Kit, or link kit.

The link kit is part of the extended utilities packages. Therefore, it is not a required component of your operating system. Although having the link kit on your system is not require for proper operation, you are unable to add devices or change kernel parameters without it.

Because the link kit directly modifies configuration files and drivers that are combined to create the operating system, a great deal of care must be exercised when making changes. If you are lucky, an incorrectly configured kernel just won't boot rather than trashing you hard disk. If you're not ... Well, is your resumé up to date?

Major and Minor Numbers

To UNIX, everything is a file. To write to the hard disk you write to a file. To read from the keyboard is to read from a file. To store backups on a tape device is to write to a file. Even to read from memory, is to read from a file. If the file you are try to read from, or write to, is a "normal" file, the process is fairly easy to understand. The file is opened and you read or write data. If, however, the device being accessed is a special device file (also referred to as a device node), a fair bit of work needs to be done before the read or write operation can begin.

One of the key aspects of understanding device files lies in the fact that different devices behave and react differently. There are no keys on a hard disk and no sectors on a keyboard. However, you can read from both. The system, therefore, needs a mechanism whereby it can distinguish between the various types of devices and behave accordingly.

In order to access a device accordingly, the operating system needs to be told what to do. Obviously the manner in which the kernel accesses a hard disk will be different from the way it accesses a terminal. Both can be read from and written to, but that's about where the similarities end. In order to access each of these totally different kinds of devices, the kernel needs to know that they are, in fact, different.

Inside the kernel are functions for each of the devices the kernel is going to access. All the routines for a specific device are jointly referred to as the device driver. Each device on the system has its own device driver. Within each device driver are the functions that are used to access the device. For devices such as a hard disk or terminal, the system needs to be able to (among other things) open the device, write to the device, read from the device and close the device. Therefore, the respective drivers will contain the routines needed to open, write to, read from and close those devices. (Among other things)

In order to determine how to access the device, the kernel needs to be told. Not only what does the kernel need to be told what kind of device is being access, but any special information such as the partition number if its a hard disk or density if it's a floppy, for example. This is accomplished by the major and minor number of that device.

The major number is actually the offset into the kernel's device driver table, which tells the kernel what kind of device it is. That is, whether it is a hard disk or a serial terminal. The minor number tells the kernel special characteristics of the device to be accessed. For example, the second hard disk has a different minor number than the first. The COM1 port has a different minor number than the COM2 port.

Figure 0-9 Process major and minor numbers

It is through this table that the routines are accessed that, in turn, access the physical hardware. Once the kernel has determined what kind of device it is talking to, it determines the specific device, the specific location or other characteristics of the device by means of the minor number.

In order to find out what functions can be used to access a particular device, you can take a look in the /etc/conf/cf.d/mdevice file, which contains a list of all the devices in the system. Aside from the function list, mdevice also contains the major numbers for that device. For details on the mdevice file, take a look at the mdevice(F) man-page.

So how do we, let alone the kernel, know what the major and minor number of a device are? By doing a long listing of the /dev directory (either with an l or an ls -l), there are two things that tell us that the files in this directory are not normal files. One thing to look at is the first character on each line. If these were regular files, the first character would be a -. In /dev almost every entry starts with either a b or a c. These represent, respectively, block devices and character devices. (The remaining entries are all directories and begin with a d. See the ls(C) man-page for additional details on the format of these entries.)

The second indicator that these are not "normal" files is the fifth field of the listing. For both regular files and directories, this field shows their size. However, device nodes do not have a size. The only place they exist is their inode (and, of course, the corresponding directory entry). There are no data blocks taken up by the device file, therefore it has no size. For device nodes, there are two numbers instead of one for the size. These are, respectively, the major and minor number of the device.

Like file sizes, major and minor numbers are stored in the file's inode. In fact, they are stored in the same field of the inode structure. File sizes can be up to 2147483648 bytes (2 Gigabytes or 231 bytes), but major and minor number are limited to a single byte each. Therefore, there can be only 256 major numbers and 256 minor numbers per major.