Jim Mohr's SCO Companion

Index

Copyright 1996-1998 by James Mohr. All rights reserved. Used by permission of the author.

Be sure to visit Jim's great Linux Tutorial web site at http://www.linux-tutorial.info/

File Systems and Files


Any time you access an SCO system, whether locally, across a network or through any other means both files and filesystems are involved. Every program that you run starts out as a file. Most of the time you are also reading or writing a file. Since files (whether programs or data files) reside on filesystems, every time you access the system you are also accessing a filesystem.

Knowing what a file is and how it is represented on the disk and how the system interprets the contents of the file is useful to help your understanding of what the system is doing. You can also use this understanding to evaluate both the system and application behavior to determine if it is proper.

Disk Layout

In order to be able to access data on your hard disk, there has to be some pre-defined structure. Without structure, it ends up looking like my desk where there are several piles of papers and I have to look though every pile in order to find what I am looking for. Instead, the layout of a hard disk follows a very consistent pattern. So consistent, that it is even possible for different operating systems to share the hard disk.

Basic to this structure is the concept of a partition. A partition defines a portion of the hard disk to be used by one operating system or another. The partition can be any size, including the entire hard disk. Near the very beginning of the disk is the partition table. The partition table is only 512 bytes, but can still define where each partition begins and how large it is. In addition, the partition table indicates which of the partitions is active. This decides which partition the system should go to when looking for an operating system to boot. The partition table is outside of any partition.

Once the system has determined which partition is active, the CPU knows to go to the very first block of data within that partition and begin executing the instructions there. On an SCO system this is an area called boot0. Although there is only 512 bytes of data in boot0, there is 1024 bytes reserved for it. The code within boot0 is sufficient to execute the code in the next block, boot1. Here, 20Kb are reserved, although the actually code is slightly less. The code within boot1 is what reads the /boot program, which will eventually load the kernel.

Immediately after boot1 is the division table. Under SCO, a division is a unit of the hard disk contained within a partition. A division can be any size, including the entire partition. Often, special control structures are created at the beginning of the division that impose an additional structure on that division. This structure makes the division a filesystem. In order to keep track of where each division starts and how big it is, the system uses the division table. The division table has functionality similar to that of a partition table, although there is no such thing as an "active" division. There can be up to seven divisions (and therefore 7 filesystems) per division, but the size of the division table is fixed at 130 bytes although 1024 bytes are reserved for the table.

Just after the division table is the bad track table. A bad track is a portion of the hard disk that has become unusable. Immediately following the bad track table is an area that is used for alias tracks. These are the tracks that are used when one of the other tracks goes bad. If that occurs, the operating system marks the bad track as such in the bad track table and indicates which of the alias tracks will be used. The size of the area taken up by the alias tracks is determined by how many entries are in the bad track table. (There is one alias track per table entry) You can see the contents of your bad track table by using the badtrk utility. Once the table and alias tracks have been defined, you cannot increase the number without re-installing.

Just after the bad track table are the divisions. If you have one of the older SCO UNIX filesystems (AFS, EAFS), there are two control structures at the beginning of the filesystem: the superblock and the inode table. The superblock contains information about the type of filesystem, it's size, how many data blocks there are, the number of free inodes, free space available and where the inode table is.

Many users at not aware of the fact that different filesystems reside on different parts of the hard disk and in many cases on different physical disks. From the user's perspective the entire directory structure is one unit from the top (/) down to the deepest sub-directory. In order to carry out this deception, the system administrator needs to mount filesystems. This is done by mounting the device node associated with the filesystem (e.g. /dev/u ) onto a mountpoint (e.g. /u). This can either be done by hand, with the mount command line or by having the system do it for you when booting. This is done with entries in /etc/default/filesys. See the mount(ADM) and the filesys(F) man-pages for more details.

Conceptually, the mountpoint serves as a detour sign for the system. If there is no filesystem mounted on the mountpoint. The system can just drive through and access what's there. If a filesystem is mounted, when the system get to the mountpoint is sees the detour sign and is immediately divert in another directions. Just as the road, treess and houses still exist on the other side of the detour sign, any file or directory that exists underneath the mountpoint is still there. You just can't get to it.

Let's look at an example. We have the /dev/u filesystem which we are mounting on /u. Let's say tha when we first installed the system and before we first mount the /dev/u filesystem, we created some users with their home directories in /u. For example, /u/jimmo. When we do finally mount the /dev/u fileystem onto the /u directory, we no longer see /u/jimmo. It is still there, however, once the system reaches the /u directory it is redirected somewhere else.

This brings up an interesting phenomena. When you use find to locate a file, it will reach the mount point and get redirected. However, nheck is not file and directory oriented, but rather filesystem oriented. If you used find you would not see /u/jimmo. However, you would if you used ncheck!

When a filesystem is mounted, the kernel reads the filesystem's superblock into an internal copy of the superblock. This way, the kernel doesn't have to keep going back to the hard disk for this information.

Figure 0-1 Boot Hard Disk Layout

The inode is (perhaps) the most important structure. It contains all the information about a file including, owner and group, permissions, creation time, and most importantly: where the data blocks are on the hard disk. The only thing it's missing is the name of the file. That's stored in the directory and not in the inode.

If you have an Desktop Filesystem (DTFS), then there is no inode table. Rather the inodes are scattered across the disk. How they are accessed, we'll get into later when we talk about the different filesystems.

After the superblock (and inode table, if there is one) you get to the actual data. Data is stored in a system of files within each filesystem (hence the name). As we talked about before in the section on SCO basics, files are grouped together into directories. This grouping is completely theoretical in the sense that there is nothing physically associating the files. Not only can files in the same directory be spread out across the disk, it is possible that the individual data blocks of a file are scattered as well.

Figure 0-1 shows you where all the structures are on the hard disk.


In most systems, there will be at least two divisions on your root hard disk. On ODT 3.0 systems these divisions will contain your root filesystem and your swap space. Although it takes up a division, just like the root filesystem, your swap space is not a filesystem. This is because it has none of the control structures(superblock, inode table) that make it a filesystem. Despite this, there must still be an entry in the division table for it. In OpenServer, there is a new filesystem at the beginning of the partition and the root filesystem is moved to place after the swap space. We'll go into more details later. (NOTE: That whether you have a the extra division will depend on what kind of installation you did. We'll cover this in more detail in chapter 13.

Up to this point we've talked a great deal about both files and directories, where they reside and what their attributes (characteristics) are. Now it's time to talk about the concepts of files and directories. We need to talked about how the operating system sees files and directories and how they system manages them.

From our discussion of how a hard disk is divided, we know that files reside within filesystems. Each filesystem has special structures that allow the system to manage the files. These are the superblock and inodes. The actual data is stored somewhere in the filesystem in datablocks. Most SCO UNIX filesystems use a block size of 1024 bytes. If you have OpenServer, the new DTFS has a variable block size.

Every SCO UNIX filesystem uses inodes, which, as I mentioned earlier, contain the important information about a file. (In some books, inode is short for information node and in others it is short for index node.) Although the structure of the inodes is different for each filesystem, they hold the same kinds of information. What each contains can be found in <sys/ino.h>, <sys/inode.h> and <sys/fs/*>. Each inode has pointers which tell it where the actual data is located. How this is done is dependent on the filesystem type and we'll get to that in a moment. One piece of information that the inode does not contain is the file name. This is contained only within the directory.

If you are running OpenServer, then there are at least three divisions used. The first one (slot 0 in the division table) is used for the /dev/boot filesystem. This contains the file that are necessary to load and start the operating system. Although this is what is used to start the system, this is not root filesystem. The root filesystem has be move to the third division. Once the system has been loaded, the /dev/boot filesystem is mounted onto the /stand directory and is accessible like any other mounted filesystem, except for the fact that it is normally mounted as read-only. In both ODT 3.0 and OpenServer, the root filesystem normally contains most of the files your operating system uses.

Depending on the size of your primary hard disk and the configuration options you chose during installation, you may have more than just these default filesystems. Common configurations included having separate filesystems for users' home directories or data.

Type

Description

EAFS

Extended Acer Fast Filesystem (default)

AFS

Acer Fast Filesystem

S51K

AT&T UNIX System V 1KB filesystem

HS

High Sierra CD-ROM Filesystem

ISO9660

ISO 9600 CD-ROM Filesystem

XENIX

XENIX Filesystem

DOS

DOS Filesystem

NFS

Network Filesystem

Table 0.1 Filesystems Supported by ODT 3.0

Type

Description

HTFS

High Throughput Filesystem (default)

DTFS

Desktop Filesystem (Compression)

EAFS

Extended Acer Fast Filesystem



AFS

Acer Fast Filesystem

S51K

AT&T UNIX System V 1KB filesystem

HS

High Sierra CD-ROM Filesystem

ISO9660

ISO 9600 CD-ROM Filesystem

XENIX

XENIX Filesystem

DOS

DOS Filesystem

NFS

Network Filesystem

Rockridge

Rockridge CD-ROM Filesystem

NetWare

SCO Gateway for NetWare Filesystem

LMCFS

LAN Manager Client Filesystems

Table 0.2 Filesystem Supported by OpenServer

Although all of these filesystem are supported, not all are configured into your kernel by default. If you have ODT 3.0, you automatically have support for the three standard UNIX filesystem (EAFS, AFS and S51K) as well as the XENIX filesystem. In OpenServer, the three UNIX filesystems supported in ODT are includes, as well as the XENIX filesystems, and the two new ones: HTFS and DTFS. In order to be recognized they must be first configured in the kernel. How this is accomplished, depends on what product you are running and what filesystem. Table 0.1 and Table 0.2 show what filesystems are supported.

If you want to use one of the network filesystem such as NFS, SCO Gateway for NetWare, and Lan Manager Client Filesystem, you need to add that product through the Software Manager. This will automatically add support into the kernel for the appropriate filesystem.

If you have ODT 3.0, then you can use sysadmsh to add the driver for each of the other filesystems. On OpenServer, you use the Hardware/Kernel Manager. In both cases, there are mkdev scripts that will do this. In fact, these scripts are called by sysadmsh and the Hardware/Kernel Manager. Table 0.3 shows you which script is run for each filesystem.


Filesystem

mkdev script

DOS

dos

DTFS

dtfs

HTFS

htfs

High Sierra/ISO9660/Rockridge

high-sierra

XENIX

xenix

Table 0.3 mkdev script associate with each filesystem

Files

From the filesystem's standpoint, there is no difference between a directory and a file. Both take up an inode, use data blocks and have certain attributes. It is the commands and programs that we use that impose structure on the directory. For example, /bin/ls imposes structure when we do listings of directories.

Keep in mind that it is the ls command that puts the file names "in order." Within the directory, the file names do not appear in any order. Initially, files appear in the directory in chronological order. As files are created and removed, the slots taken up by older files are replaced by newer ones and even this order disappears.

Other commands, such as /bin/cat or /bin/hd, allow us to seeing the directories as files, without any structure. Note that in OpenServer these commands don't let you see the structure, which I see as a lost in functionality.

When you do a long listing of a file, (ls -l or l) you can learn a lot about the characteristics of a file and directory. For example, if we do a long listing of /bin/date, we see:

-rwx--x--x 1 bin bin 17236 Dec 14 1991 /bin/date

We can see the type of file we have: the '-' in the first position says its a regular file, the access permissions (rwx--x--x), how many links it has (1 - we'll talk more about these in a moment), the owner and group (bin/bin), the size(17236), the date it was last written to (Dec 14, 1991. Maybe the time, as well, if it is a newer file), and the name of the file (/bin/date). For additional details on this, see the ls(C) man-page.

Unlike operating systems like DOS, most of this information is not stored in the directory. In fact, the only information that we see here, which is actually stored in the directory is the file's name. If not in the directory, where is this this other information kept and how do you figure out where on the hard disk the data is?

As I mentioned before, this is all stored in the inode. All the inodes on each file system are stored at the beginning of that filesystem in the inode table. The inode table is simply an set of these inode structures. If you want, you can see what the structure looks like, by taking a peek at <sys/ino.h>.

To access the information in the inode, you need the inode number. Each directory entry consists of an inode number and file name pair. On ODT 3.0 and earlier, the first two bytes of each entry were the inode number. Since a byte can hold 256 values, the maximum possible inode was 256*256, or 65535 inodes per filesystem. The inode simply points to a particular entry in the inode table. This is the only connection there is between a filename and its inode, therefore the only connection between the filename and the data on the hard disk.

Because this is only a pointer and there is no physical connection, there is nothing preventing you from having multiple entries in a directory pointing to the same file. These would have different names, but have the same inode number and therefore point to the same physical data on the hard disk. Having multiple file names on your system point to the same data on the hard disk, is not a sign of filesystem corruption! This is actually something that is done on purpose.

For example, if do a long listing of /bin/ls (l /bin/ls) you see:

-r-xr-xr-t 6 bin bin 23672 Dec 14 1991 /bin/ls

Here the number of links (Column 2) is 6. That means there are five other files on the system with the same inode number as /bin/ls. In fact that's all a link is: a file with the same inode on the same filesystem. (More on that later)To find out what inode that is, let's add the -i option to give us:

167 -r-xr-xr-t 6 bin bin 23672 Dec 14 1991 /bin/ls

From this we see that /bin/ls occupies entry 167 in the inode table. There are three ways of finding out what other files have this inode number:

find / -inum 167 -print

ncheck -i 167 /dev/root — we're assuming /bin/ls is on the root filesystem

l -iR / | grep '167'

Since I know they are all in the /bin directory, I'll try the last one. This gives me:

167 -r-xr-xr-t 6 bin bin 23672 Dec 14 1991 l

167 -r-xr-xr-t 6 bin bin 23672 Dec 14 1991 lc

167 -r-xr-xr-t 6 bin bin 23672 Dec 14 1991 lf

167 -r-xr-xr-t 6 bin bin 23672 Dec 14 1991 lr

167 -r-xr-xr-t 6 bin bin 23672 Dec 14 1991 ls

167 -r-xr-xr-t 6 bin bin 23672 Dec 14 1991 lx

Interesting. This is the entire family of ls commands. All of these lines look identical, with the exception of the file name. There are six lines, which matches the number of links. Each has a inode of 167, so we know that all six names have the same inode and therefore point the same location on the hard disk. That means that whenever you execute any one of these commands, the same program is started. The only difference is the behavior and that is based on what program you actual start on the command line. Since the program knows what name is was started with, the program can change it's behavior accordingly.

There is nothing special about the fact that these are all in the same directory. A name must only be unique within a single directory. You can therefore have two files with the same basename in two separate directories. For example, /bin/mail and /usr/bin/mail. If you take a look, these not only have the same inode number (and are therefore the same file), there are actually three links. The third link being /usr/bin/mailx. So, here we have two files in the same directory (/usr/bin/mailx and /usr/bin/mail) as well as two files with the same basename (/bin/mail and /usr/bin/mail). All of which have the same inode and are, therefore, all the same file.

The key issue here is that all three of these files exists on the same filesystem, /dev/root. As I mentioned before, there may be files on other filesystem that have the same inode. This is the reason why you cannot create a link between files on two different filesystems. With a little manipulation, you might be able to force two files with identical content to have the same inode on two filesystems. However, these are not links (just two files with the same name and same content).

The problem is that it may be necessary to create links across filesystems. One reason is that you might want to create a path with a much shorter name that easier to remember. Or perhaps you have several remote filesystems, accessed through NFS and you want to create a common structure on multiple machines all of which point to the same file. Therefore you need a mechanism that allows links across filesystems (even remote filesystems). This is the concept of a soft or symbolic link.

Symbolic links were first introduced to SCO UNIX in release 3.2.4.0. In SCO OpenServer they are (perhaps) the primary means of referring to installed files. For more information on referencing installed files, see the section on Software Storage Objects. Unlike hard links, symbolic links take up data blocks on the hard disk and therefore have a unique inode number. However, they only need one data block as the contents of that block is the path to the file you are referring to. Note that If the name is short enough the symbolic link may be stored directly in the inode. See Table 0.4 for details on filesystem characteristics.


Figure 0-2 Symbolic link

For example, if I had a file on a /u filesystem named /u/data/albert.letter. I could create a symbolic link to it as /usr/jimmo/letters/albert.letter (no, it doesn't have to have the same name). The one data block assigned to the symbolic link /usr/jimmo/letters/albert.letter contains /u/data/albert.letter. Whenever I access the file /usr/jimmo/letters/albert.letter, the system (the file system driver) knows that this is a symbolic link. The system then reads the full path out of the data block and accesses the "real" file. Since the data file contains only a path, you could have filesystem mounted via NFS where the data is stored on a remote machine. Whatever you are using to access that file (e.g. an application, a system utility) cannot tell the difference.

For example, I might have a file in my own bin directory that points to a nifty utility on my friend's machine. I have a filesystem from my friend's machine mounted to my /usr/data directory. I could create a symbolic like this:

ls -s /usr/data/nifty /usr/jimmo/bin/nifty

I would therefore have a symbolic link, /usr/jimmo/bin/nifty that looked like this:

lrwxrwxrwx 1 root other 15 May 03 00:12 /usr/jimmo/bin/nifty -> /usr/data/nifty

We see two ways that this is a symbolic link. First, the first character of the permissions field is an 'l'. Next, the name of the file itself is different than we are used to. Next, we see that the name of the file that we use (/usr/jimmo/bin/nifty) and a (sort-of) arrow that points to the "real" file (/usr/data/nifty). Note that there is nothing here that tells us that a remote filesystem is mounted onto /usr/data. The conversion is accomplished by the filesystem driver when the actual file is accessed.

If you were to use just the ls command, then you would not see either the type of file (l) or the ->, so there is no way to know that this is a symbolic link. If you use lf, then the file is followed by an at-sign (@), which tells you that the files is a symbolic link.

Keep in mind that when the system determines that you are trying to access a symbolic link, the system then goes out and tries to access the "real" file and behaves accordingly. Therefore, symbolic links can also point to directories, or any other kinds of files, including other symbolic links.

Be careful when making a symbolic link. When you do, the system does not check to see that the source file exists. It is therefore possible to have a symbolic link point back to itself or to point to nothing. Although most system utilities and commands can catch things like this, do not rely on it. Besides, what's the point of having a dog chasing its own tail. It is also advisable not to use any relative paths when using symbolic links. This may have unexpected results when accessing the links from elsewhere on your system.

Let's go back to the actual structure of the directory entries for a minute. Remember that directories are simply files that have a structure imposed on them by something. If the command or utility imposes the (correct) structure, then each directory entry takes the form of 2 bytes for the inode and 14 bytes for the file name itself.

Another change to the system that came in SCO UNIX 3.2.4.0 was the introduction of long filenames. Up to this point, file names were limited to 14 characters. With two bytes for the inode, 64 of these 16 bytes structures fit exactly into a disk block. However, with only 14 bytes for the name. This often made giving files meaningful names difficult. I don't know how many times I personally spent time trying to remove vowels and otherwise squish the name together so that it fit in 14 characters. The default filesystem on SCO UNIX 3.2.4.0 changed all that.

One thing I liked about having 16 bytes was that a directory entry fit nicely into the default output of hd. That way you could easily see the internal structure of the directory. I don't know how many times I used hd when talking with customers with filesystem problems. However, the hd included in the initial release of Open Server won't let you do this. In my opinion, removing that very useful functionality broke hd.

Up to 3.2.4.0, SCO UNIX used the Acer File System (AFS), which had some advantages over the standard UNIX (S51K) filesystem. However, neither can handle symbolic links and long file names. The Extended Acer File System (EAFS) changed that. Since the directory entries of the AFS were 16 bytes long, long file names have to "'spill over" into subsequent entries in the directory. Since a file only has one inode, extended file names beyond 14 characters need to extend into consecutive entries in the directory. Since they are taking up multiple slots, all but the last inode entry has the inode number of '0xffff'. This indicates the file name continues on in the next slot. Even with long file names, files names on an EAFS limited to 255 characters

When files are removed, the inode entry in the directory is changed to 0. Do an hd of the directory (if you're running ODT) and you still see the file name, but the inode is 0.. When a new file is created, the file name takes up a slot used by an older, previously removed file if the name can fit. Otherwise it must take a new slot. Since long names need to be in consecutive slots, they may not be able to take up empty slots. If so, new entries may need to be created for longer file names.

When you create a file, the system looks in the directory for the first available slot. If this is an EAFS, then it is possible that the file you want to create might not fit in the first slot. Remember that each slot is 16 bytes long. Two for the inode number and 14 for the file name. If, for example, slots 16 and 18 are filled, and slot 17 is free, a file name that is longer than 14 characters cannot fit there. This is because the directory entries must be contiguous.

The system must therefore, either find a slot large enough or create new slots at the end of the directory. For example, if slots 14 and 18 were taken, but slots 15-17 were free, any file less than 42 characters (14*3) would fit. Anything larger would need to go somewhere else.

If you were to count up all the bytes in the inode structure in ino.h, you'd find that each inode is 64 bytes. This means that there are 16 per disk block (16*64=512). In order to keep from wasting space, the system will always create filesystems with the number of inodes being a multiple of 16.

Inode 1 is always at the start of the 3rd block of the filesystem (bytes 2048- 2111) and is reserved (not used). Inode 2 is always the inode of the root directory of any filesystem. You can see for the root filesystem this by doing ls -id /. (The -d is necessary so you only see the directory and not the contents) `

The total number of inodes on an AFS or EAFS is defined when filesystem is created by mkfs(ADM). Normally, you use the divvy command (by hand or through SCOAdmin in OpenServer to create filesystems. The divvy command will then call mkfs to create the filesystem for you. The number of inodes created is based on an average file size of 4K. If you have a system that has many smaller files, such as a mail or news server, you could run out of inodes and still have lots of room on your system.

Therefore, if you have a news or mail server, it is a good idea to use mkfs by hand to create the filesystem before you add any files. Remember that the inode table is at the beginning of the filesystem and takes up as much room as it needs for a given number of inodes. If you want to have more inodes, you must have a larger inode table. The only place for the inode table to grow is into your data. Therefore, you would end up overwriting data. Besides, running mkfs 'zeroes' out your inode table so the pointers to the data is lost anyway.

Among other things that the inode keeps track of are file types and permissions, number of links, owner and group, size of the file and when it was last modified. In the inode is where you will find thirteen pointers (or triplets) to the actual data on the hard disk.

Note that these triplets pointers to the data and not the data itself. Each one of the thirteen pointers to the data is a block address on the hard disk. For the following discussion, please refer to Figure 0-3.

Each of these blocks is 1024 bytes (1k), therefore the maximum file size on an SCO UNIX system is 13Kb. Wait a minute! That doesn't sound right, does it? In fact it isn't. If (and that's a big if) all of the triplets pointed to data blocks, then you could only have a file up to 13Kb. However, there are dozens of files in the /bin directory alone that are larger than 13Kb. How's that?

The answer is that only the first ten of these triplets point to actual data. These are referred to as direct data blocks. The 11th triplet, points to a block on the hard disk which actually contains the real pointers to the data. These are the indirect data blocks and contain 4-byte values, so there are 256 of them in each block. In Figure 0-3, the 11th triplet contains a pointer to block 567. Block 567 contains 256 pointers to indirect data blocks. One of these pointers points to block 33453, which contains the actual data. Block 33453 is an indirect data block.

Since the data blocks pointed to by the 256 pointers in block 567 each contain 1K of data, there is an additional 256K of data. So, with the 10K for the direct data blocks and the 256K for the indirect data blocks, we now have a maximum file size of 266K.

Hmmm. Still not good. Although there aren't that many, there are files on your system larger than 266K. A good example is /unix. So, that brings us to triplet 12. This points not to data blocks, not to a block of pointers to data blocks, but to blocks that point to blocks that point to data blocks. These are the doubly-indirect data blocks.

In Figure 0-3 the 12th triplet contains a pointer to block 5601. Block 5601 contains pointers to other blocks. One of which is block 5151. However, block 5151 does not contain data, but more pointers. One of these points to block 56732. It is block 56732 that finally contains the data.

We have a block of 256 entries that each point to a block which each contain 256 pointers to 1024 byte data blocks. This gives us 64Mb, just for the doubly-indirect data blocks. At this point, the additional size gained by the single-indirect and direct data blocks is negligible. Therefore, let's just say we can access over 64Mb. Now, that's much better. You would be hard pressed to find a system with files larger than 64Mb. (Unless we are talking about large database applications) However, we're not through, yet. We have one triplet left.

So, as not to bore too many of you, let's do the math quickly. The last triplet points to a block containing 256 pointers to other blocks, each of which point to 256 other blocks. At this point, we already have 65536 blocks. Each of these 65536 blocks contain 256 pointers to the actual data blocks. Here we have 16777216 pointers to data blocks, which gives us a grand total of 17179869184 or 16Gb of data (plus the insignificant 64MB we get from the doubly indirect data blocks). Oh, as you might have guesses, these are the triply indirect data blocks.


Figure 0-3 Inodes Pointing to Disk blocks

In Figure 0-3 triplet 13 contains a pointer to block 43. Block 42 contains 256 pointers, one of which points to block 1979. Block 1979 also contains 256 pointers, one of which points to block 988. Block 988 also contains 256 points. However, these pointers point to the actual data. For example, block 911.

If you are running an ODT 3.0 (or earlier) system, 16Gb is not your actually size limit. This is the theoretical limit place on you by the number of triply indirect data blocks. Since you need to keep track of the size of the file and this is stored in the inode table as a signed long integer (31 bits) the actual limit is 2Gb.

As I mentioned a moment ago, when a file is removed all that is done is the inode is set to 0. However, the slot remains. In most cases this is not a problem. However, when mail gets backed up, for example, there can be thousands of files in the mail spool directories. Each one of these requires a slot within the directory. As a result, the directory files can grow to amazing sizes. I have seen directories where the size of the directory file was over 300,000 bytes. This equates to about 20,000 files.

This brings up a couple of interesting issues. Remember that there are 10 direct data blocks for 10Kb, then 1 singly-indirect for 256K for a total of 266Kb for both single and doubly indirect data blocks. If you have a case where the directory file is exceptionally large, and the file you are looking for happens to be at the very end of the directory file, the system must first read all 10 direct data blocks, then read the 11th block that points to the single-indirect data blocks, then read all 64 of those data blocks, then it reads the 12th block in the inode to find where the data blocks are for the pointers are, then reads the blocks containing the pointers, then reads the actual data blocks for the remainder of the directory file. Since a copy of the inode is read into memory, there is no need to go back out to the disk.

On the other hand, remember there are 64 blocks containing the singly-indirect pointers. Each one of them has to be read, then each of the blocks they point to has to be read to check to see if your file is there. Then you need to read the data blocks that point to the data blocks that point to where your directory is. Only then do you find out that you mis-typed your file name and you have to do it all over again.

Since the system can usually get them all in one read, it is best to keep the number of files in a directory at 638 or less. 638? Sure. Each block can hold 64 entries. There are 10 data blocks, so the 10 direct data blocks can hold 640 entries. Each directory always contains the entries . and .., therefore you can only have 638 addition entries.

The next interesting thing is what happens when you run fsck on your system. If the filesystem is clean, there won't be a problem. What happens if you have a system crash and your filesystem becomes corrupted? If during the check, fsck finds files that are pointed to by inodes, but does not find any reference to them in a directory, it will place them in the /lost+found directory. When each file system is created, the system automagically creates 62 files in there and then removes them. This leaves 62 empty directory slots. 62 files plus . and .., which gives you 64 total entries times 16bytes =1024 bytes or one data block.

The reason for the lost+found directory is that you don't want the system to be writing anything to a filesystem that you are trying to clean. It is safe enough to be filling in directory entries, but you don't want the system to be creating any new files while trying to clean the filesystem. This is what would happen if you had more than 62 "lost" files.

If you have a trashed filesystem and there are more than 62 lost files, they really become lost. The system cannot handle the additional files and has to remove them. Therefore, I think it is a good idea to create additional entries and then remove them whenever creating a new file system. This way you are prepared for the worse. A script to do this would be:

cd /lost+found

for i in a b c d e f g h i j

do

for j in a b c d e f g h i j

do

for k in a b c d e f g h i j

do

touch $i$j$k

done

done

done

rm *

This scripts creates 1000 files and then removes them. This takes up about 16K for the directory file, however it allows 1000 files to become "lost”, which may be a job-saver in the future. Make sure that the rm is done after all the files are created, otherwise you end up creating a file, removing it, then filling the slot with some other file. The result is that you have fewer files than you expected.

If you look in /usr/lib/mkdev/fs, (what is actually run when you run mkdev fs) you see that the system does something like this for your every time you add a filessystem. Just after you see the message:

Reserving slots in lost+found directory ...

the mkdev fs script does something very similar. The key difference is that mkdev fs only creates 62 entries. If you wanted to create 1000 entries every time you ran mkdev fs, you could change that part of mkdev fs to look the above script.

Something that I always found interesting was that /bin/cp, /bin/ln and /bin/mv are all the same binary. That is, they are all links to each other. When you link a file, all that needs to get done is to create a new directory entry, fill it in with the correct inode and then increase the link count in the inode table. Copying a file also creates a new directory entry, but it must also write the new data blocks to the disk.

When you move a file, something interesting happens. First, the system creates a link to the original file. It then removes the old file name by unlinking it. This simply clears the directory entry by setting the inode to 0. However, once the system creates the link, for a brief instant there are two files on your system.

When you remove a file, things get a little more complicated. We need to not only remove the directory entry referencing the file name, but we also need to decrease the link count in the inode table. If the link count at this point is greater than 0, then the system knows that there is another file on the system pointing to the same data blocks. However, if the link count reaches 0, we then know that there are no more directory entries pointing to that same file. The system must then free those data blocks and make then available for other files.

Some of you might have realized that special device files (device nodes) do not take up any space on the hard disk. The only place they "exist" is in the directory entry and inode table. You may also have noticed that in the inode structure there is no entry for the major and minor number. However, if you do a long listing of device node, you will see the major and minor number. Where is this kept?

Well, since you don't have any data blocks to point to, then the 39 bytes used for the data block pointers are unused. This is exactly where the major and minor number are stored. The first byte of the array is the major number and the second byte is the minor number. This is one reason why major and minor numbers cannot be larger than 255.

As with many aspects of the system, the kernel's role in administering and managing filesystem is wide reaching and varied. Among its tasks is the organization of disk space within the filesystem. This function is different, depending on what type of filesystem you are trying to access. For example, if you are copying files to a DOS FAT filesystem, then kernel has to be aware that there are different cluster sizes depending on the size of the partition. (A cluster is a physical grouping of data blocks).

If you have an AFS (Acer Fast file System) or EAFS (Extended AFS), then the kernel attempts to keep data in logically contiguous blocks called clusters (on most modern hard disks, this also means physically contiguous). By default, a cluster is 16kb, but can be changed when the filesystem is created by using mkfs.

When reading data off the disk, the system can read clusters rather than single blocks. Since files are normally read sequentially, the efficiency of each read is increased. This is because the system can read larger chunks of data and doesn't have to go looking for them. Therefore, the kernel can issue fewer (but larger) disk requests. If you have a hard disk controller that does "track caching" (storing previously read tracks), you improve your read efficiency even more.

However, the number of files may eventually grow to the point where storing data in 16K chunks is no longer practical. If there are no more free areas that are at least 16Kb, the system would have to being moving things around to make a 16Kb block available. This would waste more time than would be gained by maintaining the 16Kb cluster.

Therefore, these chunks will need to be split up. As the file system gets fuller, the amount the chunks are split up (called fragmentation) increases. Therefore, the system ends up having to move to different places on the disk to find the data. Because of this the kernel ends up sending multiple requests, slowing down the disk reads even further. (It's always possible since you can move data blocks from other files. However, this takes time and is therefore not practical.)

Figure 0-4 Disk fragmentation

The kernel is also responsible for the security of the files themselves. Because SCO UNIX is a multi-user system, it is important to ensure that users only have access to the files that they should have access to. This access is on a per file basis in the form of the permissions set on each file. Based on several discussions we've had so far, we know that these permissions tell us who can read, write or execute our files. It is the kernel, that makes this determination. The kernel also imposes the rule that only the owner or the all-powerful root may change the permissions or ownership of a file.

Allocation of disk blocks is dependent upon organization of what is called the freelist. When a file is opened for the first time, its inode is read into the kernel generic inode table. This is a "generic” table as it is valid for all filesystems. Therefore, on subsequent reads and writes this information is already available and the kernel does not have to make an additional disk read to get the inode information. Remember, it is the inode that contains the pointers to the actual data blocks. If this information were not kept in the kernel, every time the file was accessed this information would need to be read from the hard disk.

Keep in mind that if you have a process that is reading or writing to the disk, it is the kernel that does the actual disk access. This is done through the filesystem and hard disk drivers. Every time the kernel does a read of a block device, the kernel first checks the buffer cache to see if the data already exists there. If so, then the kernel has saved itself a disk read. Obviously if it's not there, the kernel must read it from the hard disk.

At first this seems liked a waste of time. I mean, checking one place and then checking another. Every single read checks the buffer cache first. So, in many cases, this is wasted time. True. However, the buffer cache is in RAM. This can be several hundred times faster than accessing the hard disk. As a result of the principle of locality, your process (and the kernel as well) will probably be accessing the same data over and over again. Therefore, the existence of the buffer cache is actually a great time saver, since the number of times it finds something in the cache (the hit ratio) is so high.

When writing a file (or parts of a file), the data is first written to the buffer cache. If it remains unchanged for a specific period of time (defined by the BDFLUSHR kernel parameter), the data is then written to the disk. This also saves times because if data is written to the disk, then changed before it is read again, you've wasted a disk write. However, if it stays in the buffer cache forever (or until the file is closed, the process terminates, etc) then you run the risk of loosing data is the system crashes. Therefore, BDFLUSHR is set to a reasonable default of 30 seconds.

As I mentioned a moment ago, when a file is first opened, its inode is read into the kernel's generic inode table. (Assuming it is not already there) This table is the same no matter what kind of file system you have (S51K, AFS, etc). The structure of this table is defined in <sys/inode.h>. The size of this is configurable in ODT 3.0 with the kernel parameter NINODE.

The entries in the generic inode table are linked into hash queues. A hash queue is basically a set of linked lists. Which list a particular inode will go into dependents on it's value. This speeds things up, since the kernel does not have to search the entire inode table, but can immediately jump to the relatively smaller hash queue. The more hash queues there are (defined by the NHINODE kernel parameter) the faster things are read since each queue has fewer entries. However, the more queues there are, the more space in memory is required and less room for other things. Therefore, you need to weigh one against the other.

Since there is normally no pattern as to which files a removed from the inode table and when, the free slots in the table are spread throughout the table randomly. Free entries in the generic table are linked onto the freelist so new inodes may be allocated quickly.

One advantage that SCO UNIX provides is the ability to access different kinds of filesystems. Because of this, the kernel must also keep track of filesystem specific information, such as that contained in the inode table. This information is also kept in a kernel internal table, based on the filesystem. The System V dependent inode data structure is defined in <sys/fs/s5inode.h>, and is used by S15K, AFS and EAFS. Other inode tables exist for High-Sierra and DOS. Each time a file is opened an entry is allocated in both the generic and the System V dependent inode table (unless already in memory). The information contained in these inode table is going to be different, depending on what kind of filesystem you are dealing with.

When a process wants to access a file, it does so using a system call such as open() or write(). When first writing the code for a program, the system calls that programmers normally use are the same no matter what the file system type. When the process is running and makes one of theses system calls, the kernel maps that system call to operations appropriate for that type of FS. This is necessary since the way a file is accessed under DOS, for example, is different than under EAFS. The mapping information is maintained in a table, one per file system and is constructed during a relink from information in /etc/conf/mfsys.d and /etc/conf/sfsys.d. The kernel then accesses the correct entry in the table by using the FS type as an index into the fstypesw[ ] array

Another table used by the kernel to keep track of open files is the file table. This allows many processes to share the same inode information, and is defined in <sys/file.h>. Because it is often the case that multiple process have the same file open, this saves the kernel time, by not having to look up inode information for each process individually. Once a file is open and is in the file table, the kernel does not have to re-read the inode table

Figure 0-5 Translation from file descriptions to data

By the time the kernel actually has the inode number of a file that you are working with, it has gone through three different reference points. First, there is the uarea of your process that has the translation from your personal file descriptors to the entry in the file table. Next, the file table has the references that point the kernel to the appropriate slot in it's generic inode table. Last, the generic inode table has the pointers to the file system specific inode table. At first, this may seem like a lot of work. However, keep in mind that this is all in RAM. Without this mechanism, the kernel would have to go back and forth to the disk all the time.

The open() system call is implemented internally as a call to the namei() function. This is the name-to-inode conversions. Namei() sets up both the generic inode table entry and filesystem dependent inode table entry. It returns a pointer to the generic table entry. Namei() then calls another function, the routine falloc(), which sets up an entry in the file table to point to the inode in the generic table.

The kernel then calls the ufalloc() routine, which sets up a file pointer in the process's uarea to point to the file table entry set up by falloc(). Finally the return value to open() is index into the file pointer array, known as the file descriptor.

The function of namei() is a bit more complicated than just converting a filename to an inode number. This seems like a simple thing to say, but in practice, there is a lot more to it. Namei() converts the filenames to inodes (not to inode numbers). Obviously it must first get the inode number, but that is a relatively easy chore, since that is the contained within the directory entry of the file.

In order to find out what inode table to read, namei() needs to know on which filesystem a file resides. Simply reading the inode from the directory entry is not enough. As we talked about before, two completely different files can have the same inode provided they are on different file systems. Therefore, even though namei() has the inode number, it still does not know which inode table to read.

In order to find the filesystem, namei() needs to have a complete pathname to the file. A UNIX pathname consists of zero or more directories, separated with '/' terminated by the filename. The total path length cannot be more than 1024 characters. Assuming there is no directory name mentioned when the file is opened, (or only a relative path) namei has to back track a little to get back up to the top of the directory tree.

If not already in memory, the inode corresponding to the first directory in the pathname is read into memory. The directory file is read into memory and the inode/filename pairs are searched for the next directory component. The next directory is read in and the process continues until the actual file is reached. We now have the inode of the file.

With relative paths or no paths at all, we have to back track. That is, in order to find the root directory of the filesystem we are on, we have to find the parent directory of our file, then it's parent and so on until we reach the root.

Looking at this, we see the pathname to inode conversion is time consuming. Each time a new directory is read, there must be a read of the hard disk. In order to speed up things, SCO UNIX caches the directories. The size of the cache is set by the s5cachent kernel tunable parameter and the entries defined in <sys/fs/s5inode.h>. Whenever the kernel searches for a component of the file name, it checks the correct hash queue. In ODT 3.0 the s5cachent structures can't hold more than 14 characters. Therefore, for the long file names possible with EAFS, the kernel must go directly to the disk. Cache hits and misses are recorded and can be retrieved with sar and can monitored.

In a S51K (Traditional UNIX) filesystem, the superblock contains a list of both free blocks and free inodes. There is room for 50 blocks in the free block list and 100 inodes in the free inode list. The structure of the superblock is found in <sys/fs/s5filsys.h>.

When creating a new file, the system examines the array of free inode numbers in the superblock and the next free inode number assigned. Since this list only has 100 entries, they will all eventually get used up. If total number of free inodes drops to zero, the list is filled in with another 100 from the disk. If there ever less than 100 free inodes, then the unused entries are set to 0. In S51K filesystems, the list of free data blocks, the freelist, is ordered randomly. As disk blocks are freed, they are just appended to end of freelist. During allocation of data blocks, no account is made for physical location of the data blocks. This means that there is no pattern to where the files reside on the disk, and can quickly lead to fragmentation. That is, data blocks from one file can be scattered all over the disk.

In AFS and EAFS the freelist is held as a bitmap, where adjacent bits in the map correspond to logically contiguous blocks. Therefore the system can quickly search for sets of bits representing free blocks and then allocate files in contiguous blocks. Logically contiguous blocks (usually physically contiguous blocks) are known as a cluster.

When the filesystem is first created, the bitmap is created by mkfs. There is 1 bit for every data block on the filesystem, so the bitmap is a linear array which says whether a particular block contains valid data or not. Note that this bitmap also occupies disk blocks itself. Actually there is more than one bitmap. There are several which are spaced at intervals of approximately 8192 blocks throughout the filesystem. Since a block contains 1024 bytes, it contains 8192 bits and can therefore map 8192 blocks. There is also an indirect freelist block, which holds a list of the disk block numbers which actually contain the bitmaps.

When a file is created, the entire cluster is reserved for the file. Although this does tend to waste a little space, it reduces fragmentation and therefore increases speed. When kernel reads a block, it reads the whole cluster the file belongs to as well as the next. This is called read ahead.

When a disk block is needed for a new file, the system searches the bitmap for the first free block. If we later need more data blocks for an existing file, the system begins it search starting from the block number that was last allocated for that file. This helps to ensure new blocks are close to existing ones. Note that when a cluster is allocated, not all of the disk blocks may be free (maybe it is already allocated to another file).

The bitmapped freelist of the AFS and EAFS has some performance advantages. First, files are typically located in contiguous disk blocks. These can be allocated quickly from the free list. using i80386 bit manipulation instructions. This means that free areas of the disk can be found in just a few instruction cycles and therefore access speeds up.

Figure 0-6 The AFS freelist

In addition, the freelist is held in memory. The advantage is that this keeps the system from having to make an additional disk access every time the system wants to write new blocks to the hard disk. When kernel issues an I/O request to read from a single disk block, the AFS maps the request so that the entire cluster contain the disk block and following cluster are read from disk.

At the beginning of each filesystem is filesystem specific structure called the superblock. You can find out about the structure of the superblock by looking in <sys/fs/*>. The Sys V superblock is located in 2nd half of first block of filesystems (bytes 512-1023). Since the structure is less than 512 bytes, it contains padding to fill out to 512 bytes. When a filesystem is first mounted, its superblock is read into memory so updates to the superblock don't have to constantly write to the disk.

In order for the structures on the disk to remain compatible with the copies in memory, superblocks and inodes are updated by sync which is started at regular intervals by init. The frequency of the sync is defined by SLEEPTIME in /etc/default/boot, with a default of 60 seconds.

New Filesystems

New Concepts

There are several concepts new to Open Server. The first is intent logging. When this functionality is enabled, filesystem transactions are recorded in a log and then committed to disk. If the system goes down before the transaction is completed, the log is replayed to complete pending transactions. This scheme increases reliability and recover speed since the system need only read the log to be able to bring the system to the correct state. By using this scheme, the time spent checking the filesystem (and repairing it if necessary) can be reduced to just a few seconds, not the several minutes that was required previously, regardless of the filesystem size. There is, however, a small performance penalty since the system has to spend some time writing to the logs.

As changes are being made to any of the control structures (inodes, superblock), the changes are written to a log. Once complete, the transactions is marked as complete. However, if the system should go down before the log is written, it is as if the transaction was never started. If the log is complete, but the transaction hasn't finished, the transaction can either be completed or ignored, depending what fsck considers possible. Obviously if the system goes down after the transaction is complete, then nothing needs to be done.

The location of the log file is stored in the superblock. As a real file it does reside somewhere on the file system, however it is invisible to normal user-level utilities and only becomes visible when logging is disabled.


Intent logging does bring up one misconception in that it does not increase the reliability of the system. Only changes to the control structures are logged. Data is not. The purpose here is to reduce the time it takes to make the system operational again should it go down.

Another new the concept is checkpointing. When enabled the filesystem is marked as "clean" at regular intervals. That is, the pending writes are completed, inodes are updated and, if necessary, the in-core copy of the superblock is written to disk. At this point the filesystem is considered clean. Should the system go down improperly at this point, there is no need to clean the filesystem (using fsck) as it is already clean. However, the data is still cached in the buffer cache, so if it is needed again soon, it is available.

If the system goes down, the contents of the buffer cache are lost, but since they were already written to disk, no data is actually lost. Obviously, anything not written between the last checkpoint and the time the system goes down is lost, but checkpointing does decrease the amount lost as well as speed up the recovery process when the system is rebooted. Again, there is no such thing as a free lunch and checkpointing does mean a small performance loss. Checkpointing is turned on by default on High Throughput Filesystem (HTFS) , EAFS, AFS, and S51K filesystems.

For the best reliability and speed of recovery, it's a good idea to have both logging and checkpointing enabled. Although they both cause slight performance degradation, the benefits outweigh the performance hit. In most cases, the performance loss is not noticed, only the time required to bring the system back up is a lot quicker.


The idea of sync-on-close for the Desktop Filesystem (DTFS) is another way of increasing reliability. Whenever a file is closed, it is immediately written to disk, rather than waiting for the system to write it as it normally would (potentially 30 seconds later). If the system should do down improperly, you have a better chance of not loosing data. Because you are not writing data to the hard disk in large chunks, sync-on-close also degrades performance.

Because I regularly suffer from digitalus enormus (fat fingers), I am often typing in things that I later regret. On a few occasions, I have entered rm commands with wild cards (*, for example) only to find that I had a extra space before the asterisk. As a result, I end up with a nice clean directory. Since I am not that stupid, I built an alias so that every time I used rm it would prompt me to confirm the removal (rm -i). My brother, on the other hand, created an alias where rm copies the files into a TRASH directory, which he needs to clean out regularly. Both of these solutions can help you recover from accidentally erasing files.

OpenServer has adding something, whereby you no longer have to create aliases or other things necessary to keep you from erasing things you shouldn't. This is the idea of file versioning. Not only does file versioning protect you from digitalus enormus, but will also make automatic copies of files for you.

In order for versioning to be used, it must be first configured in the kernel. There are several kernel tunable parameters that are involved. So to change them you either run the program /etc/conf/cf.d/configure or click on the "Tune Parameters..." button in the Hardware/Kernel Manager. (The Hardware/Kernel Manager calls configure). Next select option 10 (Filesystem configuration). Here you will need to set the MAXVDEPTH parameter, which set the maximum number of versions maintained and the MINVTIME parameter which set the minimum time (in seconds) between changes before a file is versioned. Setting MAXVDEPTH to 0 disables versioning. If MINVTIME is set to 0, and MAXVDEPTH to a non-zero value, then versioning will happen no matter how short the time between versions. Versioning is only available for the DTFS and HTFS.

You can also set versioning for a filesystem by using the maxvdepth and minvtime options when mounting. These can be included in /etc/default/filesys (which defines the default behavior when mounting filesystems), or you can specify them on the command line when mounting the filesystem by hand. In addition to that, versioning can be set on a per-directory basis. This is done by using the undelete command. For example,


undelete -s /usr/jimmo/letters


This command line turns on versioning for all the files in the directory /usr/jimmo/letters as well as any child directories. This includes existing files and directories and well as ones created later. Note that even though the filesystem was not mounted with either the minvtime or maxvdepth options, you can still turn on versioning for individual directories, as long as it is configured in the kernel. Also, using the -v option to undelete you can turn on versioning for single files.

When enabled, versioning is performed without the interaction of the users. If you delete or overwrite a file, you usually don't see anything. You can make the existing versions visible to you by setting the SHOWVERSIONS environment variable to 1, and then exporting it.

The means of storing versions is quite simple. The names are appended with a semi-colon followed by the version of the file as in:

letter;12

This would be the 12th version of the file letter since versioning was enabled on the filesystem. Keep in mind that this does not mean that there are 12 versions. The number of available versions is defined by the MAXVDEPTH kernel parameter or mount option. If higher than 12, there just might be 12 versions. However, if set to a lower value you will see at most MAXVDEPTH versions. Also keep in mind that you are are not just mainting a list of changes, but rather complete copies of each file.

For example, let's assume I mounted a filesystem with the option -o maxvdepth=10. The system will then save, at most, 10 versions. After I edit and save a file for a while, the version number might be up to 12. However, I will not be able to see or have access to versions lower than 3, since there are removed from the system.


Different file versions can not only be accessed when making copies or changes to existing file, but also when you remove them. Assume you have the three latest versions of a letter (letter;10, letter;11 and letter;12) as well as the current version letter. If you remove letter, the three previous versions still exist. These can be seen by using the -l (list) option to undelete, either by specifying the file explicitly as in:

undelete -l letter

or if you leave off the file name, you will see all versions of all files. To undelete a versioned file or make the previous version the current one, simply leave off the options. If you repeated use undelete with just the file name, you can backup and make ever older versions the current one. Or, to make things easier, simply copy the older version to the current one, as in:

cp letter\;8 letter

This will make version 8 the current one. (NOTE: The \ is necessary to remove the special meaning of the semi-colon.)

With the first shipping version of OpenServer, there are some "issues" with versioning in that it does not behave as expected. One of the first things I noticed was that changing the kernel parameters MAXVDEPTH and MINVTIME do not turn on versioning. Instead, they allow versioning to be turned on. Without them, you can't get versioning to work at all. When version is enabled, you still need to use undelete -s on the directory.

There is more to it than that. However, I don't want to repeat too much information that's in the manuals. Therefore, take a look at the undelete(C) man-page.


There are other changes that have been made to the system. There is the introduction of a couple new filesystem types as well as adding new features to the old filesystems. Table 0.4 contains an overview of some of the more significant aspects of the filesystems.


Filesystem Type:

Xenix

S51K

AFS

EAFS

HTFS

DTFS

Driver

xx

ht

ht

ht

ht

dt

Max. fs size

2Gb

2Gb

2Gb

2Gb

2Gb

2Gb

Max. file size

2Gb

2Gb

2Gb

2Gb

2Gb

2Gb

Max. inodes

216

216

216

216

227

231

Clustering

no

no

yes

yes

yes

yes

Long filenames

no

no

no

yes

yes

yes

Symbolic links

no

no

no

yes

yes

yes

Bootable

yes

yes

yes

yes

no

no








New functionality in OpenServer 5







Symbolic links in inode

no

no

no

no

yes

yes

Intent logging

no

no

yes

yes

yes

no

Fast filesys. check

no

no

yes

yes

yes

no

Lazy block list evaluation

no

yes

yes

yes

yes

no

Temporary fs

no

no

yes

yes

yes

no

Checkpointing

no

no

yes

yes

yes

yes

Versioning

no

no

no

no

yes

yes

Table 0.4 Filesystem Characteristics

High Throughput Filesystem

New to OpenServer is the introduction of a new filesystem device driver: ht. This new driver can handle filesystems with 16-bit inodes like S51K, AFS and EAFS, but also the new HTFS which can handle 32-bit inodes. Although (as of this writing) you cannot boot from an HTFS, it does provide some important performance and functionality gains.

One area that was changed is the total amount of information that can be stored on a single HTFS as the total number of inodes that can be used. Table 0.4 contains a comparison of the various filesystem types and just how much data they can access.

Another new feature of the ht driver is lazy block evaluation. Previously, when a process was started with the exec() system call, the system would build a full list of the blocks that made up that program. This delayed the actual start-up of the process, but save time as the program ran. Since a program spends most of it's time executing the same instructions, much of the program is not used. That is, many of the blocks end up never being referenced. What lazy block evaluation does is to build this list of blocks only as they are needed. This speeds up the start-up of the process and causes small delays when a previously unreferenced block is first access.

Another gain is through "transaction based" processing of the filesystem. As activity occurs on the system, they are gathered together in what is called an intent log, which we talked about earlier. If the system stops improperly, the system can use the intent log to make a determination of how to proceed. Since you only need to check the log in order to clean the filesystem, it is quicker and also more reliable.

Another mechanisms used to increase throughput is to disable checkpointing. This way, the filesystem will spent all of it's time processing requests rather than updating the filesystem structures. Although this increases throughput, you obviously have the disadvantage of potentially loosing data.

When dealing with aspects of the system like the print spooler or the mail system when jobs are batched processed, at any given moment it is less likely that data is being processed. Therefore you do not need the extra overhead of checkpointing.

This is done by treating the filesystem as "temporary". Such filesystems are mounted with the -o tmp option. Although checkpointing is new to OpenServer, you can configure both AFS and EAFS filesystems as temporary. Keep in mind that certain applications like vi' provide their own recovery mechanism by saving the data at regular intervals. If the files are written by vi, but not written to disk, a system crash could loose the last update.

When I described the directory structure I mentioned that each inode was represented by two bytes. This allows only for 64K worth of inodes. Since the HTFS can access 227 inodes and the DTFS can access 231, there needs to be some other format used in the directories. With the two new filesystems, the key word is "extensible." This mean that the structure can be extended as the requirement changes. This allows much more efficient handling of long files names, as compared to the EAFS. In most cases, the filesystem driver is capable of making the translation for applications that don't understand the new concepts. However, if the applications reads or write the directory directly, you may need a newer version of the application.

The two new filesystems, HTFS and DTFS, can save space by storing symbolic links in the inode. If the path of the symbolic link is 108 characters or less, the DTFS will store the path within the inode and not in a disk block on the disk. For the HTFS, this limit is 52 characters. First this saves space as not data blocks are needed, but it also saves times since once you read the inode from the inode table, you have the path and do not need to access the disk.


There are two issue to keep in mind. If you use relative paths instead

of absolute paths, then you may end up with a shorter path that fits into the inode. This saves time when accessing the link. On the other hand, think back to our discussion on symbolic links. The behavior of each shell when crossing the links is different. If you fail to take this into account, you may end up somewhere other than you expect.


Desktop Filesystem

One of the problems that the advances that SCO OpenServer brought with it is the increased amount of hard disk space required to install it. On large servers with several gigabytes of space, this is less of an issue. However, on smaller desktop workstations this can become a significant problem.

Operating systems have been dealing with this issue for years. MS-DOS provides a potential solution in the form of it's DoubleSpace disk compression program. Realizing the need for such space savings, SCO Open Server provides a solution in the form of the new DTFS. Among the issues that need to be addressed is not only the saving of space, but also the reliability of the data and avoiding any performance degradation that occurs when compressed files need to be uncompressed. On fast CPUs with fast hard disks, the preformance hit because of the compression is noticeable.

The first issue (saving space) is addressed by the DTFS in a couple of ways. The first is that files are compressed before they are written to the hard disk. This can save anywhere from just a few percent in the case of binary programs to over 50% for text files. What you get will depend on the data you are storing.

The second method space is saved is the way inodes are store on the disk. With "traditional" filesystems such as S51K or EAFS, inodes are pre-allocated. That is, when the filesystem is first created, a table for the inodes allocated at the beginning of the filesystem. This is a consistent size no matter how many or how few inodes are actually used. Inodes on a DTFS are allocated as needed. Therefore, there are only as many inodes as there are files.

As a result you never have any empty slots in the inode table. (Actually there is no inode table in the form we discussed for other filesystems. We'll get to this in a moment.) In order to distinguish these inodes from others, inodes on the DTFS are referred to as dtnodes.

Figure 0-7 The DTNODE map

The DTFS has many of the same features as the EAFS filesystem, such as file length up to 255 characters and symbolic links. In addition, the DTFS also has multiple compression algorithms, greater reliability (through the integrated kernel update daemon, which attempts to keep the file system in a stable state), and dynamic block allocation algorithm that can automatically switch between best-fit and first fit. Best-fit is where the system looks for an appropriately size spot on the hard disk for the file and first-fit is where the system looks for the first one that is large enough (even if it is much larger than necessary).

As one might expect, the disk layout is different from other filesystems. The first block (block 0) was historically the "boot block" and has been retained for compatibility purposes. The second block (block 1) is the super block and like other filesystems it contains global information about the filesystem.

Following the superblock is the block bit-map. There is one block for each 512-byte data block in the filesystem, so the size of the bitmap will vary depending on the size of the filesystem. If the bit is on (1), the block is free, otherwise the block is allocated.

The block bitmap is followed by the dtnode bitmap. It's size is the same as the block bitmap since there is also one bit for each block. The difference is that these bits determine if the corresponding block contains data or dtnodes. A 1 indicates the block contains dtnodes and a 0 indicates data. Following these two bitmaps are the actual data and dtnode blocks. Since the dtnodes are scattered throughout the filesystem, there is no inode table.

Unlike the inodes of other filesystems, dtnodes are not pre-allocated when the filesystem is created. Instead, they are allocated at the same time as the corresponding file. This has the potential for saving a great deal of space since every dtnode points to data in contrast to other filesystems where inodes may go unused and therefore the space they occupy is wasted.

The translation from dtnode number is straight forward. The dtnode number has the same number as the block number that it resides on. For example, if block 1256 was a dtnode, then that dtnode number would be 1256. This means that since not all blocks contain dtnodes, not all dtnode numbers are used. The one exception to this is that the dtnode number of the root of the filesystem is stored in the superblock. Each dtnode is accessed through the dtnode map.

The contents of the superblock are found in the files location in <sys/fs/ >. If you take a quick look at it you see several important pieces of information. One of the most important ones is the size of the filesystem. Many of the other parameters included in this structure can be calculated from this value. These include the root dtnode number, start of the bitmaps, the start of the data blocks, as well as the number of free blocks. Although the values can be calculated, it saves time by also storing these values in the superblock.

As I mentioned earlier, the block size of the DTFS varies in increments of 512 bytes between 512 and 4096. The reason for the range is that empirical studies have shown that filesystem throughput increases as the block size increases. However, in an effort to save space (a primary consideration in the DTFS), smaller block sizes were also allowed.

Before being written to the disk, regular files are compressed using one of two algorithms (one being "no compression"). Because of this compression, it is no longer possible to directly calculate a physical block on the hard disk based on the offset in the file. For example, let's assume a file that begins at block 142 of the filesystem. On a non-compressed filesystem, we could easily find byte 712 since block 0 of the file contains bytes 0-511, and block 1 contains bytes 512-1023. Therefore, byte 712 is in block 101 of the filesystem.

However, if we have a compressed filesystem, there is no immediate way of knowing if the compression is sufficient to place byte 712 into block 142, or it is still in block 143. We could start at the beginning of the file and calculate how much uncompressed data is in each block. Although this would eventually give us the correct block, the amount of time spent doing the calculations more than eliminates advantages gained by the compression.

In order to solve this problem, the structure on the hard disk is maintained in a structure called a B-tree. Without turning this book into a discussion on programming techniques, it is sufficient to say that the basic principle of a B-tree forces it to be balanced, therefore the depth of one leaf node is at most one level away from the depth of any other leaf node.

Conceptually the B-tree works like this: Let's assume a block ‘a', is the root node. The block offset of every data block that is on the left hand branch of a is smaller than the block offset in a. Also, the block offset of every data block that is on the right hand branch of ‘a' is larger than the block offset in ‘a'. This then applies to all subsequent blocks, where the left hand branch is smaller and the right hand branch is larger.

In order to find a particular offset in the file you start at the top of the tree and work down. If the block offset is less than the root, you go down the left hand branch. Likewise, if the block offset is greater you go down the right hand branch. Although you still have to traverse the tree, the amount you have to search is far less than a pure linear search. Each node has a pointer to both the previous and the next nodes. This allows traversal of the tree in both directions.

Regular files are the only ones that are compressed. Although supported, symbolic links and device nodes are left as they are, since you don't save any space. If a symbolic link is smaller than 192 bytes, the name is actually stored within the dtnode. The size of blocks containing directories is fixed at 512 since directories are typically small. Long names are allowed on the DTFS up to a maximum of 255 characters (plus the terminating NULL). One interesting aspect is the layout of the directory structure. This is substantially different than on the (E)AFS. Among other things there are entries for the size of the filename and size of the directory entry itself.

The DTFS has several built in features that provide certain protections. The first is a technique called "shadow paging." When a data is about to be modified, extras blocks are allocated that "shadow" the blocks that are going to be changed. The changes are then made to this shadow. Once the change is complete, the changed blocks "replace" the previous blocks in the tree and the old blocks are then freed up. This is also how the dtnode blocks are modified except that the shadow is contained within the same physical block.

If something should happen before the new, changed block replaces the old one, then it is as if the change was never started. This is because the file has no knowledge of anything ever happening to it. Unlike changes on an EAFS, AFS and other "traditional” UNIX filesystems, where changes are made to blocks that are already a part of the file. If the system should go down in the middle of writing, then the data is, at best inconsistent, or at worst, trashed. Obviously, in both cases, once the changes are complete and something happens, the file remains unaffected.

Figure 0-8 Updating blocks on a DTFS

Also unique to the DTFS is the way the dtnodes are updated. If you look at the structure and count up the number of bytes, you find that the amount of data that each block takes up is less than half the size of the block (512 bytes). The other half is used as the shadow for that dtnode. When it gets updated, the other half is written to first, only after the information is safely written does the new half become "active". Here again, if the system crashed before the transaction was complete, then it would appear as if nothing was ever started. By comparing the timestamp we can tell which half is active.

Other than saving space, there is another logic to splitting the block in half. Remember that the dtnode points to the nodes that are both above it and below it in the tree. Assume we didn't shadow the dtnode. When one dtnode gets updated, it would get replaced by a new node. Now the nodes above it and below it need to be modified to point to this new node. In order to update them, we have to copy them to new blocks as well. Now, the nodes pointing to these blocks need to get updated. This "ripples" in both directions until the entire tree is update. Quite a waste of time.

Another technique used to increase the reliability is the update daemon (htepi≠_daemon). Once per second the update daemon checks every writeable filesystem. If the update daemon writes out all the data to that filesystem before another process writes to that filesystem, the update daemon can write out the superblock as well and can then mark the filesystem as clean. If the system were to crash before another process made a write to that filesystem, then it would still be clean and therefore no fsck would be necessary.

Built into the dtnode is also a pointer to the parent directory of that dtnode. This has a very significant advantage when the system crash and the directory entry for a particular file gets trashed. In traditional SCO filesystems, if this happened, there would be files without names and when fsck ran, they would be placed in lost+found. Now, since each file knows who its parent is, the directory structure can easily be rebuilt. That's why there is no more lost+found directory.


High-Performance Pipe System

In the section on operating system basics I introduced the concept of a pipe. We all (hopefully) know it through the many commands we have seen in this book. For example, if I want to see the long listing of some directory a screen at a time, I can issue the command:

ls -l | more

As I mentioned, there are actually data blocks taken up on the hard disk to store the data as the system is waiting for the receiving side to read it. For all intents and purposes this is a real file. It contains data (usually) and it has an inode. The only difference is that, unless it is a named pipe, it has no entry in any directory and therefore no file name. When then system goes down by accident and cannot close the pipes, fsck will report them as unreferenced files. This is very disconcerting to many users as they see a long list of unreferenced files when fsck runs after a crash.

This represents only one of the problems existing with traditional pipes. The other is the fact that these pipes exist on the hard disk. When the first process writes to the disk, there is a disk access. When the second process reads the disk, there is a disk access. Since disk access a bottleneck on most systems, this can be a problem. (NOTE: This ignores the existence of the buffer cache. However, if sufficient time passes between the write and subsequent read, then the buffer cache will no longer contain the data and two disk accesses are necessary)

SCO OpenServer has done something to correct that. This is the High Performance Pipe System (HPPS). The primary difference between the HPPS and conventional pipes is that the HPPS pipes no longer exist on the hard disk. Instead, they are maintain solely within the kernel as buffers. This corrects the two previously discussed disadvantages of conventional pipes. First, when the system goes down, the pipes simply disappear. Second, since there is no disk interaction, there is never any performance slow-down as a result.

Like traditional pipes, when HPPS pipes are created an inode is created with it. This inode contains the necessary information to administer that pipe.

Virtual Disks

One of the major additions to OpenServer is the idea of "virtual disks". These can come in many forms and sizes, each providing its own special benefits and advantages. To the running program (whether it is an application or system command), these disks appear like any other. As a user, the only difference you see is perhaps in the performance improvements that some of these virtual disks can yield.

There are several different kinds of virtual disks which can be used depending on your needs. For example, you may be running a database that requires more contiguous space than you have on any one drive. Pieces of different drives can be configured to a single, larger drive. If you need a quicker way of recovering from a hard disk crash, you can mirror your disks, where one disk is an exact copy of the other. This also increases performance since you can read from either disk. Performance can also be increased by striping your disks. This is where portions of the logical disk are spread across multiple physical disks. Data can be written to and read to the disks in parallel, thereby increasing performance. Some of these can even be combined.

Underlying many of the virtual disks type is the concept of RAID. RAID is an acronym for Redundant Array of Inexpensive Disks. Originally, the idea was that you would get better performance and reliability from several, less expensive drives linked together as you would from a single, more expensive drive. The key change in the entire concept is that hard disk prices have dropped so dramatically that RAID is no longer concerned with inexpensive drives. So much so, that the I in RAID is often interpreted as meaning "Intelligent", rather than "Inexpensive."

In the original paper that defined RAID, there were five levels. Since that paper was written, the concept has been expanded and revised. In some cases, characteristics of the original levels are combined to form new levels.

Two concepts are key to understanding RAID. These are redundancy and parity. The concept of parity is no different than that used in serial communication, except for the fact that the parity in a RAID system can be used to not only detect errors, but correct them. This is because more than just a single bit is used per byte of data. The parity information is stored on a drive separate from the data. When an error is detected, the information is used from the good drives, plus the parity information to correct the error. It is also possible to have an entire drive fail completely and still be able to continue working. Usually the drive can be replaced and the information on it rebuilt even while the system is running. Redundancy is the idea that all information is duplicated. If you have a system where one disks is an exact copy of another, one disk is redundant for the other.

In some cases, drives can be replaced even while the system is running. This is the concept of a hot spare. This is done from the Virtual Disk Manager. Some hardware vendors even provide the ability to physically remove the drive from the system without having to shut the system down. This is called a hot swap All the control for the hard disks is done by the hard disk controller and the operating system sees only a single hard disk. In either case, data is recreated on the spare as the system is running.

Keep in mind that SCO does not directly support hot swapping. This must be supported by the hardware in order to ensure the integrity and safety of your data.

SCO's implementation of RAID is purely software. Makes sense since SCO is a software company. Other companies provide hardware solutions. In many cases, hardware implementations of RAID present a single. logical drive to the operating system. In other words, the operating system is not even aware of the RAIDness of the drives it is running on.

In Figure 0-9 we see how the different layers of a virtual drive are related. When an application (vi, a shell, cpio) accesses a file, it makes as system call. Depending on whether you are accessing a file through a raw device or the the filesystem code, the application uses the block or character code within the device driver. The device driver it accesses at this point is for the virtual disk. the virtual disk then access the device driver for the physical hard disk.

What device the virtual device driver access depends on how it is configured. (Which we will get to shorty.) The interesting thing is that the virtual disk driver access the device in the same way, regardless of what type of disk it is. Accessing the physical disk is the problem of the physical disk driver and not the virtual disk driver. It is therefore possible to have virtual disks composed of different types of disks as well as disks on different controllers.

The simplest virtual disk is called (what else) a simple disk. With this, you can define all your non-root filesystem space as a single virtual disk. This can be done to existing filesystems and not only provides more efficient storage, but using virtual disks instead of conventional filesystems makes it easier to change to the more complex virtual disks. This is because you cannot add existing filesystem to virtual disks. They must be first converted to simple disks.

A concatenated disk is created when two or more disk pieces are combined. In this way, you can create logical disks that are larger than any single disk. Disks that are concatenated together do not need to be the same size. The total available space is simply the sum of call concatenated disks. New peices cannot be added to concatenated disks once the filesystem is created. Remember that the filesystem sees this as a logical drive. Division and inode tables are based on the size of the drive when it is added to the system. Adding a new piece would require you to recreate the filesystem.

A striped array is also referred to as RAID 0 or RAID Level 0. Here, portions of the data are written to and read from multiple disks in parallel. This greatly increases the speed at which data can be accessed. This is because half of the data is being read or written by each hard disk, which cuts the access time almost in half. The amount of data that is written to a single disk is referred to as the stripe width. For example, if single blocks are written to each disk, then the stripe width would be a block.

Figure 0-9 Virtual Disk Layers

This type of virtual disk provides increased performance since data is being read from multiple disks simultaneously. Since there is no parity to update when data is written, this is faster than system using parity. However, the drawback is that there is no redundancy. If one disk goes out, then data is probably lost. Such a system is more suited for organizations where speed is more important than reliability.

Keep in mind that data is written to all the physical drives each time data is written to the logical disk. Therefore, the pieces must all be the same size. For example, you could not have one piece that was 500 MB and a second piece that was only 400 Mb. (Where would the other 100 be written?) Here again, the total amount of space available is the sum of all the pieces.

Disk mirroring (also referred to as RAID 1) is where data from the first drive is duplicated on the second drive. When data is written to the primary drive, it is automatically written to the secondary drive as well. Although this slows things down a bit when data is written, when data is read it can be read from either disk, thus increasing performance. Mirrored systems are best employed where there is a large database application


Figure 0-10 Striped Array With No Parity (RAID 0)

and availability of the data (transaction speed and reliability) is more important than storage efficiency. Another consideration is the speed of the system. Since it takes longer than normal to write data, mirrored systems are bettered suited to database applications where queries are more common than updates.

As of this writing, OpenServer does not provides for mirror of the /dev/stand filesystem. Therefore, you will need to copy this information somewhere else. One solution would be for you to create a copy of the /dev/stand filesystem on the mirror driver yourself. I have been told by people at SCO that an Extended Funtionality Supplement (EFS) is planned to allow you to mirror /dev/stand and boot from it, as well.

The term used for RAID 4 is a block-interleaved undistributed parity array. Like RAID 0, RAID 4 is also based on striping, but redundancy is built in with parity information written to a separate drive. The term "undistributed" is used since a single drive is used to store the parity information. If one drive fails (or even a portion of the drive), the missing data can be created using the information on the parity disk. It is possible to continue working even with one drive inoperable since the parity drive is used on-the-fly to recreate the data. Even data written to the disk is still valid since the parity information is updated as well. This is not intended as a means of running your system indefinitely with a drive missing, but rather it gives you the chance to stop your system gracefully.


Figure 0-11 Striped Array with Undistributed Parity (RAID 4)

RAID 5 takes this one step further and distributes the parity information to all drives. For example, the parity drive for block 1 might be drive 5 but the parity drive for block 2 is drive 4. With RAID 4, the single parity drive was accessed on every single data write, which decreased overall performance. Since data and parity and interspersed on a RAID 5 system, no single drive is overburdened. In both cases, the parity information is generated during the write and should drive go out, the missing data can be recreated. Here again, you can recreated the data while the system is running, if a hot spare is used.

Figure 0-12 Striped Array With Distributed Parity (RAID 5)

As I mentioned before, some of the characteristics can be combined. For example, it is not uncommon to have to have stripped arrays mirrored as well. This provides the speed of a striped array with redundancy of a mirrored array, without the expense necessary to implement RAID 5. Such a system would probably be referred to as RAID 10 (RAID 1 plus RAID 0). All of these are configured and administered using the Virtual Disk Manager, which then calls the dkconfig utility. It is advised that, at first, you use the Virtual Disk Manager since it is easier to use. However, once you get the hang of things, there is nothing wrong with using dkconfig directly.

The information for each virtual disk is kept in the /etc/dktab file and is used by dkconfig to administer virtual disks. Each entry is made up of two lines. The first is the virtual disk declaration line. This is followed by one or more virtual piece definition lines.

Here we have an example of an entry in a dktab file that would be used to create a 1 GB array. (This is RAID 5)


/dev/dsk/vdisk1 array 5 16

/dev/dsk/1s1 100 492000

/dev/dsk/2s1 100 492000

/dev/dsk/3s1 100 492000

/dev/dsk/4s1 100 492000

/dev/dsk/5s1 100 492000

The first line is virtual disk declaration line and varies in the number of fields depending on what type it is. In each case, the first entry is the device name for the virtual device followed by what type it is. For example, if you have a simple virtual disk, there is only the device name followed by the type (simple). Here, we are creating a disk array, so we have array in the type field.

A simple disk consists of just a single piece. The other types, such as mirror, concatenated, etc, require a third field to indicate how many pieces (simple disks) go into making up the virtual disk. Since we are creating a disk our of five pieces, this value is 5.

If you use striped disks or disk arrays, then the fourth field defines the size of the cluster in 512-byte blocks. We are using a value of 16, therefore we have an 8K cluster size. If you have mirrored disks, then the fourth field is the "catch up" block size and is used when the system is being restored.

The virtual piece definition line describes a piece of the virtual disk. (In this case we have five pieces) It consists of three fields. The first is the device node of the physical device. Note that in our case, each of the physical drives is a separate physical drive. (We know this because of the device names 1s1-5s1).

The second field is the offset from the beginning of the physical device of where to start the disk piece. Be sure you leave enough room so you start beyond the division and bad track tables. Here we are using a value of 100 and since the units are disks blocks (512 bytes), we are starting 50K from the beginning of the partitions, which is plenty of room.

The third field is the length of the disk piece, Here you need to be sure that you do not go beyond the end of the disk piece. In our case we are specifying 492000. This is also in disk blocks. Therefore, each of the physical pieces is just under 250Mb. Since the actual amount of storage we get is the sum of all the pieces, we have just under 1000Mb or 1Gb.

To change this array to RAID 4, where there is a single drive that is used solely for parity, we could add a fourth field to one of the virtual piece description lines. For example, If we wanted to turn drive three into the parity drive, we would change it to look like this:


/dev/dsk/3s1 100 492000 parity

Okay, so you've decided that you need to increase performance or reliability (or both) and have decided to implement a virtual disk scheme. Well, which one? Before you decide, there are several things you need to consider. The System Administrators Guide contains a checklist of things to consider when deciding which is best for you.

Things to Consider

If you create an emergency boot/root floppy on a system with virtual disks, there are a couple of things to remember. First, once you create a virtual disk, you should create a new boot/root floppy set. This is especially important if the virtual disk you are adding is a mirror of the root disk. If you do not and later need to boot from the floppy, then any changes made to the root filesystem will not be made to the mirror. The drives will then be inconsistant.

In order to boot correctly, you need to change the default boot string. Normally, the default boot string points to hd(40) for the root filesystem. Instead, you need to change it to reflect the fact that the root filesystem is mirror. For example, you could use the string:

fd(60)unix.z root=vdisk(1) swap=none dump=none

This tells the system to use virtual disk 1 as the root filesystem.

Note also that the device names are probably different from one machine to another. Therefore, it may not be possible to use the boot/root floppy set from one machine on another.

Its also possible to "nest" virtual drives. For example, you could have several drives that you make into to a striped array. This striped array is seen as a single drive, which you can then include in another virtual disks. For example, you could mirror that striped array.

Be careful with this, however. It is not recommend that you nest virtual drivers with redundant (mirrored, RAID 5) inside of other virtual disks. This can cause the virtual disk driver to hang, preventing access to all virtual drives.

Accessing DOS Files

Even on a standard SCO UNIX system, without all the bells and whistles of TCP/IP, X-Windows, and SCO Merge, there are several tools that you can use to access DOS filesystems. These can be found on the doscmd (C) man-page. Although these tools have some obvious limitations due to the differences in the two operating systems, they provide the mechanism to exchange data between the two systems.

Copying files between DOS and UNIX systems presents a unique set of problems. One of the most commonly misunderstood aspects of this is using wildcards to copy files from DOS to UNIX. This we can do using the doscp command,.for example:

doscp a:* .

One might think that this command copies all the files from the a: drive (which is assumed to be DOS formatted) into the current directory. The first problem is the way that DOS interprets wildcards. Using a single asterisk would only match files without an extension. For example, it would match LETTER, but not LETTER.TXT. So, if we exand the wildcard to include the possibility of the extensions, we get:

doscp a:*.* .

Which should copy everything from the floppy into the current directory. Unfortunately, that's not the way it works either. Instead of the message:

doscp: /dev/install:* not found

You get the slight variation:

doscp: /dev/install:*.* not found

Remember from our discussion of shell basics. It is the shell that is doing the expansion. Since nothing matches, we get this error. The solution to the problem was a little shell script that does a listing of the DOS device. Before we go on, we need to side-step a little. There are two ways to get a directory listing off a DOS disk. The first is with the dosdir command. This gives you output that appears just like you if you had run the dir command under native DOS. In order to use this output, we would have to parse each line to get the file name. Not an easy thing. The other is dosls, which gives a listing that looks like the UNIX ls command. Here you have a single column of files names with nothing else. Much easier to parse. The problem is that the file names come off in capital letters. Although this is not a major problem, I like to keep my file names as consistant as possible. Therefore, I want to convert them to lower case.

Skipping the normal things I put into scripts like usage messages and argument checking, the script could look like this:

DIR=$1

dosls $DIR | while read file

do

echo "$file"

doscp "$dosdir/$file" `echo $file | tr "[A-Z]" "[a-z]"` Note the back-ticks

done

The script takes a single argument which is assigned to the DIR variable. We then do a dosls of that directory which is piped to the read. If we think back to the section on shell programming, we know that this construct reads input from the previous command (in this case dosls) until the output ends. Next, we have a do-done loop that is done once for each line. In the loop, we echo the name of the file (I like to seeing what's going on) and then make the doscp.

The doscp line is a little complex. The first part ($dosdir/$file) is the source file. The second part, as you would guess, is the destination file. However, the syntax here gets a little bit. Remember that the back-ticks mean "the output of the command". Here, that command is echo | tr. Note that we are echoing the file name through tr and not the contents. It is then translated in such a way that all capital letters are converted to lower case. See the tr(C) man-page for more details.

To go the other way (UNIX to DOS), we don't have that problem. Wild cards are expanded correctly, so we end up with the right files. In addition, we don't need to worry about the names, since they are converted for us. The problem lies in names that do not fit into the DOS 8.3 standard. If a name is longer that eight characters or the extension is longer than three characters it is simply truncated. For example, the name letter_to_jim.txt ends up as letter_t.txt, or letter.to.jim becomes letter.to.

One thing to keep in mind here is that copying files like this is only really useful for text files and data files. You could use it to copy executables to your SCO system if you were running SCO Merge, for example. However, this process does not convert a DOS executable into a form that native SCO can understand, nor vis-versa.

Be careful when copying files because of conversions that are made. With UNIX text files, each line is ended with a carriage return (CR) character. The system converts this to a carriage return-new line (NL) pair when outputting the line. You can ensure that when copying files from DOS to UNIX that the CR-NL is converted to simply a CR by using the -m option to doscp. This also ensures that the CR is converted to a CD-NL when copying the other way. If you want to ensure that no conversion is made, use the -r option.

You can also make the conversion using either the xtod or dtox commands. The xtod command converts UNIX files to DOS format and the dtox converts DOS format files to UNIX. In both cases, the command takes a single argument and outputs to stdout. Therefore, to actually "copy" to a file you need to re-direct stdout.

An alternative to doscp is to mount the DOS disk. Afterwards you can use standard UNIX commands like cp to copy files. Although this isn't the best idea for floppies, it is wonderful for DOS hard disks. In fact, I have it configured so that all of my DOS file systems are mounted automatically via /etc/default/filesys. To be able to do this, you have to add the support for it in the kernel. Fortunately, it is simply a switch that is turned on or off via the mkdev dos script. Since it makes changes to the kernel, you need to relink and reboot.

Once you have run mkdev dos, you can mount DOS filesystem by hand or, as I said, through /etc/default/filesys. For example, if we wanted to mount the first DOS partition on the first drive, you have two choices of devices: /dev/hd0d or /dev/dsk/0sC. I prefer the latter, since I have several DOS partition, some do not have an equivalent for the first form. Therefore, by using /dev/dsk/0sC, I am consistant in the names I use. If we wanted to mount it onto /usr/dos/c_drive, the command would be:

mount -f DOS /dev/dsk/0sC /usr/dos/c_drive

The only issue with this is that in ODT, the file name were all capitalized. In OpenServer, there is the lower option, which is used to show all the file names in lower case. Therefore, the command would look like this:

mount -f DOS -o lower /dev/dsk/0sC /usr/dos/c_drive

Although you can use the mkdev fs script to add a DOS filesystem, it displays a couple of annoying messages. Since I think it is just as easy to eadit /etc/default/filesys, I do so. There is also the issue that certain options are not possible through the mkdev fs script or the Filesystem manager. Therefore, I simply copy an existing entry and end up with something like this:

bdev=/dev/dsk/0sC cdev=/dev/rdsk/0sC \

mountdir=/usr/dos/c_drive mount=yes fstyp=DOS,lower \

fsck=no fsckflags= rcmount=yes \

rcfsck=no mountflags=

The key point is the fstyp entry. Since we can specify mount options here, I specified the lower option so that all filename's would come out in lower case. Each time I go into multi-user mode, this filesystem is mounted for me. For more details on the options here, check out the mount(ADM) man-page or the section on filesystems. (Note: The lower option is only available in OpenServer.)

Keep in mind that if the DOS filesystem that you are mounting contains a compressed volume, you will not see the files with the compressed volume. This applies to both ODT and OpenServer.

Another of the DOS commands that I use often is dosformat. Although there are a few options (-v to promopt for volume name, -q for quiet mode, -f to run in non-interactive mode), I never have used them. The one thing I need to point out is that you format a UNIX floppy with the raw device (e.g. /dev/rfd0), but with dosformat, you format the block device (e.g. /dev/fd0).

The remaining files, which I use only on occassion are:

dosrm- Removes files from a DOS filesystem

dosmkdir - make a directory on a DOS filesystem

dosrmdir - move directories from a DOS filesystem


Go Look

As with the kernel components, I suggest you go poke around the system a little. Take a look at the files on your system that we talked about to see what filesystems you have, where they are amount and anything else you can find out about your system. Look for different kinds of files. If they are hard links, try to find out what other files are linked to it. If you find a symbolic link, take a look at the file pointed to by that the symbolic link.

In every case, look at the file permissions. Think about why are they set that way and what influence this has on their behavior. Also think about the kind of file it is and who can access it. If you aren't sure of what kind of file it is, you can use the file command that can find out for you.



Next: Starting and Stopping the System

Index

Copyright 1996-1998 by James Mohr. All rights reserved. Used by permission of the author.

Be sure to visit Jim's great Linux Tutorial web site at http://www.linux-tutorial.info/


Popular Pages