Jim Mohr's SCO Companion

Index

Copyright 1996-1998 by James Mohr. All rights reserved. Used by permission of the author.

Be sure to visit Jim's great Linux Tutorial web site at http://www.linux-tutorial.info/

File Systems and Files

Any time you access an SCO system, whether locally, across a network or through any other means both files and filesystems are involved. Every program that you run starts out as a file. Most of the time you are also reading or writing a file. Since files (whether programs or data files) reside on filesystems, every time you access the system you are also accessing a filesystem.

Knowing what a file is and how it is represented on the disk and how the system interprets the contents of the file is useful to help your understanding of what the system is doing. You can also use this understanding to evaluate both the system and application behavior to determine if it is proper.

Disk Layout

In order to be able to access data on your hard disk, there has to be some pre-defined structure. Without structure, it ends up looking like my desk where there are several piles of papers and I have to look though every pile in order to find what I am looking for. Instead, the layout of a hard disk follows a very consistent pattern. So consistent, that it is even possible for different operating systems to share the hard disk.

Basic to this structure is the concept of a partition. A partition defines a portion of the hard disk to be used by one operating system or another. The partition can be any size, including the entire hard disk. Near the very beginning of the disk is the partition table. The partition table is only 512 bytes, but can still define where each partition begins and how large it is. In addition, the partition table indicates which of the partitions is active. This decides which partition the system should go to when looking for an operating system to boot. The partition table is outside of any partition.

Once the system has determined which partition is active, the CPU knows to go to the very first block of data within that partition and begin executing the instructions there. On an SCO system this is an area called boot0. Although there is only 512 bytes of data in boot0, there is 1024 bytes reserved for it. The code within boot0 is sufficient to execute the code in the next block, boot1. Here, 20Kb are reserved, although the actually code is slightly less. The code within boot1 is what reads the /boot program, which will eventually load the kernel.

Immediately after boot1 is the division table. Under SCO, a division is a unit of the hard disk contained within a partition. A division can be any size, including the entire partition. Often, special control structures are created at the beginning of the division that impose an additional structure on that division. This structure makes the division a filesystem. In order to keep track of where each division starts and how big it is, the system uses the division table. The division table has functionality similar to that of a partition table, although there is no such thing as an "active" division. There can be up to seven divisions (and therefore 7 filesystems) per division, but the size of the division table is fixed at 130 bytes although 1024 bytes are reserved for the table.

Just after the division table is the bad track table. A bad track is a portion of the hard disk that has become unusable. Immediately following the bad track table is an area that is used for alias tracks. These are the tracks that are used when one of the other tracks goes bad. If that occurs, the operating system marks the bad track as such in the bad track table and indicates which of the alias tracks will be used. The size of the area taken up by the alias tracks is determined by how many entries are in the bad track table. (There is one alias track per table entry) You can see the contents of your bad track table by using the badtrk utility. Once the table and alias tracks have been defined, you cannot increase the number without re-installing.

Just after the bad track table are the divisions. If you have one of the older SCO UNIX filesystems (AFS, EAFS), there are two control structures at the beginning of the filesystem: the superblock and the inode table. The superblock contains information about the type of filesystem, it's size, how many data blocks there are, the number of free inodes, free space available and where the inode table is.

Many users at not aware of the fact that different filesystems reside on different parts of the hard disk and in many cases on different physical disks. From the user's perspective the entire directory structure is one unit from the top (/) down to the deepest sub-directory. In order to carry out this deception, the system administrator needs to mount filesystems. This is done by mounting the device node associated with the filesystem (e.g. /dev/u ) onto a mountpoint (e.g. /u). This can either be done by hand, with the mount command line or by having the system do it for you when booting. This is done with entries in /etc/default/filesys. See the mount(ADM) and the filesys(F) man-pages for more details.

Conceptually, the mountpoint serves as a detour sign for the system. If there is no filesystem mounted on the mountpoint. The system can just drive through and access what's there. If a filesystem is mounted, when the system get to the mountpoint is sees the detour sign and is immediately divert in another directions. Just as the road, treess and houses still exist on the other side of the detour sign, any file or directory that exists underneath the mountpoint is still there. You just can't get to it.

Let's look at an example. We have the /dev/u filesystem which we are mounting on /u. Let's say tha when we first installed the system and before we first mount the /dev/u filesystem, we created some users with their home directories in /u. For example, /u/jimmo. When we do finally mount the /dev/u fileystem onto the /u directory, we no longer see /u/jimmo. It is still there, however, once the system reaches the /u directory it is redirected somewhere else.

This brings up an interesting phenomena. When you use find to locate a file, it will reach the mount point and get redirected. However, nheck is not file and directory oriented, but rather filesystem oriented. If you used find you would not see /u/jimmo. However, you would if you used ncheck!

When a filesystem is mounted, the kernel reads the filesystem's superblock into an internal copy of the superblock. This way, the kernel doesn't have to keep going back to the hard disk for this information.

Figure 0-1 Boot Hard Disk Layout

The inode is (perhaps) the most important structure. It contains all the information about a file including, owner and group, permissions, creation time, and most importantly: where the data blocks are on the hard disk. The only thing it's missing is the name of the file. That's stored in the directory and not in the inode.

If you have an Desktop Filesystem (DTFS), then there is no inode table. Rather the inodes are scattered across the disk. How they are accessed, we'll get into later when we talk about the different filesystems.

After the superblock (and inode table, if there is one) you get to the actual data. Data is stored in a system of files within each filesystem (hence the name). As we talked about before in the section on SCO basics, files are grouped together into directories. This grouping is completely theoretical in the sense that there is nothing physically associating the files. Not only can files in the same directory be spread out across the disk, it is possible that the individual data blocks of a file are scattered as well.

Figure 0-1 shows you where all the structures are on the hard disk.


In most systems, there will be at least two divisions on your root hard disk. On ODT 3.0 systems these divisions will contain your root filesystem and your swap space. Although it takes up a division, just like the root filesystem, your swap space is not a filesystem. This is because it has none of the control structures(superblock, inode table) that make it a filesystem. Despite this, there must still be an entry in the division table for it. In OpenServer, there is a new filesystem at the beginning of the partition and the root filesystem is moved to place after the swap space. We'll go into more details later. (NOTE: That whether you have a the extra division will depend on what kind of installation you did. We'll cover this in more detail in chapter 13.

Up to this point we've talked a great deal about both files and directories, where they reside and what their attributes (characteristics) are. Now it's time to talk about the concepts of files and directories. We need to talked about how the operating system sees files and directories and how they system manages them.

From our discussion of how a hard disk is divided, we know that files reside within filesystems. Each filesystem has special structures that allow the system to manage the files. These are the superblock and inodes. The actual data is stored somewhere in the filesystem in datablocks. Most SCO UNIX filesystems use a block size of 1024 bytes. If you have OpenServer, the new DTFS has a variable block size.

Every SCO UNIX filesystem uses inodes, which, as I mentioned earlier, contain the important information about a file. (In some books, inode is short for information node and in others it is short for index node.) Although the structure of the inodes is different for each filesystem, they hold the same kinds of information. What each contains can be found in <sys/ino.h>, <sys/inode.h> and <sys/fs/*>. Each inode has pointers which tell it where the actual data is located. How this is done is dependent on the filesystem type and we'll get to that in a moment. One piece of information that the inode does not contain is the file name. This is contained only within the directory.

If you are running OpenServer, then there are at least three divisions used. The first one (slot 0 in the division table) is used for the /dev/boot filesystem. This contains the file that are necessary to load and start the operating system. Although this is what is used to start the system, this is not root filesystem. The root filesystem has be move to the third division. Once the system has been loaded, the /dev/boot filesystem is mounted onto the /stand directory and is accessible like any other mounted filesystem, except for the fact that it is normally mounted as read-only. In both ODT 3.0 and OpenServer, the root filesystem normally contains most of the files your operating system uses.

Depending on the size of your primary hard disk and the configuration options you chose during installation, you may have more than just these default filesystems. Common configurations included having separate filesystems for users' home directories or data.

Type

Description

EAFS

Extended Acer Fast Filesystem (default)

AFS

Acer Fast Filesystem

S51K

AT&T UNIX System V 1KB filesystem

HS

High Sierra CD-ROM Filesystem

ISO9660

ISO 9600 CD-ROM Filesystem

XENIX

XENIX Filesystem

DOS

DOS Filesystem

NFS

Network Filesystem

Table 0.1 Filesystems Supported by ODT 3.0

Type

Description

HTFS

High Throughput Filesystem (default)

DTFS

Desktop Filesystem (Compression)

EAFS

Extended Acer Fast Filesystem



AFS

Acer Fast Filesystem

S51K

AT&T UNIX System V 1KB filesystem

HS

High Sierra CD-ROM Filesystem

ISO9660

ISO 9600 CD-ROM Filesystem

XENIX

XENIX Filesystem

DOS

DOS Filesystem

NFS

Network Filesystem

Rockridge

Rockridge CD-ROM Filesystem

NetWare

SCO Gateway for NetWare Filesystem

LMCFS

LAN Manager Client Filesystems

Table 0.2 Filesystem Supported by OpenServer

Although all of these filesystem are supported, not all are configured into your kernel by default. If you have ODT 3.0, you automatically have support for the three standard UNIX filesystem (EAFS, AFS and S51K) as well as the XENIX filesystem. In OpenServer, the three UNIX filesystems supported in ODT are includes, as well as the XENIX filesystems, and the two new ones: HTFS and DTFS. In order to be recognized they must be first configured in the kernel. How this is accomplished, depends on what product you are running and what filesystem. Table 0.1 and Table 0.2 show what filesystems are supported.

If you want to use one of the network filesystem such as NFS, SCO Gateway for NetWare, and Lan Manager Client Filesystem, you need to add that product through the Software Manager. This will automatically add support into the kernel for the appropriate filesystem.

If you have ODT 3.0, then you can use sysadmsh to add the driver for each of the other filesystems. On OpenServer, you use the Hardware/Kernel Manager. In both cases, there are mkdev scripts that will do this. In fact, these scripts are called by sysadmsh and the Hardware/Kernel Manager. Table 0.3 shows you which script is run for each filesystem.


Filesystem

mkdev script

DOS

dos

DTFS

dtfs

HTFS

htfs

High Sierra/ISO9660/Rockridge

high-sierra

XENIX

xenix

Table 0.3 mkdev script associate with each filesystem

Files

From the filesystem's standpoint, there is no difference between a directory and a file. Both take up an inode, use data blocks and have certain attributes. It is the commands and programs that we use that impose structure on the directory. For example, /bin/ls imposes structure when we do listings of directories.

Keep in mind that it is the ls command that puts the file names "in order." Within the directory, the file names do not appear in any order. Initially, files appear in the directory in chronological order. As files are created and removed, the slots taken up by older files are replaced by newer ones and even this order disappears.

Other commands, such as /bin/cat or /bin/hd, allow us to seeing the directories as files, without any structure. Note that in OpenServer these commands don't let you see the structure, which I see as a lost in functionality.

When you do a long listing of a file, (ls -l or l) you can learn a lot about the characteristics of a file and directory. For example, if we do a long listing of /bin/date, we see:

-rwx--x--x 1 bin bin 17236 Dec 14 1991 /bin/date

We can see the type of file we have: the '-' in the first position says its a regular file, the access permissions (rwx--x--x), how many links it has (1 - we'll talk more about these in a moment), the owner and group (bin/bin), the size(17236), the date it was last written to (Dec 14, 1991. Maybe the time, as well, if it is a newer file), and the name of the file (/bin/date). For additional details on this, see the ls(C) man-page.

Unlike operating systems like DOS, most of this information is not stored in the directory. In fact, the only information that we see here, which is actually stored in the directory is the file's name. If not in the directory, where is this this other information kept and how do you figure out where on the hard disk the data is?

As I mentioned before, this is all stored in the inode. All the inodes on each file system are stored at the beginning of that filesystem in the inode table. The inode table is simply an set of these inode structures. If you want, you can see what the structure looks like, by taking a peek at <sys/ino.h>.

To access the information in the inode, you need the inode number. Each directory entry consists of an inode number and file name pair. On ODT 3.0 and earlier, the first two bytes of each entry were the inode number. Since a byte can hold 256 values, the maximum possible inode was 256*256, or 65535 inodes per filesystem. The inode simply points to a particular entry in the inode table. This is the only connection there is between a filename and its inode, therefore the only connection between the filename and the data on the hard disk.

Because this is only a pointer and there is no physical connection, there is nothing preventing you from having multiple entries in a directory pointing to the same file. These would have different names, but have the same inode number and therefore point to the same physical data on the hard disk. Having multiple file names on your system point to the same data on the hard disk, is not a sign of filesystem corruption! This is actually something that is done on purpose.

For example, if do a long listing of /bin/ls (l /bin/ls) you see:

-r-xr-xr-t 6 bin bin 23672 Dec 14 1991 /bin/ls

Here the number of links (Column 2) is 6. That means there are five other files on the system with the same inode number as /bin/ls. In fact that's all a link is: a file with the same inode on the same filesystem. (More on that later)To find out what inode that is, let's add the -i option to give us:

167 -r-xr-xr-t 6 bin bin 23672 Dec 14 1991 /bin/ls

From this we see that /bin/ls occupies entry 167 in the inode table. There are three ways of finding out what other files have this inode number:

find / -inum 167 -print

ncheck -i 167 /dev/root — we're assuming /bin/ls is on the root filesystem

l -iR / | grep '167'

Since I know they are all in the /bin directory, I'll try the last one. This gives me:

167 -r-xr-xr-t 6 bin bin 23672 Dec 14 1991 l

167 -r-xr-xr-t 6 bin bin 23672 Dec 14 1991 lc

167 -r-xr-xr-t 6 bin bin 23672 Dec 14 1991 lf

167 -r-xr-xr-t 6 bin bin 23672 Dec 14 1991 lr

167 -r-xr-xr-t 6 bin bin 23672 Dec 14 1991 ls

167 -r-xr-xr-t 6 bin bin 23672 Dec 14 1991 lx

Interesting. This is the entire family of ls commands. All of these lines look identical, with the exception of the file name. There are six lines, which matches the number of links. Each has a inode of 167, so we know that all six names have the same inode and therefore point the same location on the hard disk. That means that whenever you execute any one of these commands, the same program is started. The only difference is the behavior and that is based on what program you actual start on the command line. Since the program knows what name is was started with, the program can change it's behavior accordingly.

There is nothing special about the fact that these are all in the same directory. A name must only be unique within a single directory. You can therefore have two files with the same basename in two separate directories. For example, /bin/mail and /usr/bin/mail. If you take a look, these not only have the same inode number (and are therefore the same file), there are actually three links. The third link being /usr/bin/mailx. So, here we have two files in the same directory (/usr/bin/mailx and /usr/bin/mail) as well as two files with the same basename (/bin/mail and /usr/bin/mail). All of which have the same inode and are, therefore, all the same file.

The key issue here is that all three of these files exists on the same filesystem, /dev/root. As I mentioned before, there may be files on other filesystem that have the same inode. This is the reason why you cannot create a link between files on two different filesystems. With a little manipulation, you might be able to force two files with identical content to have the same inode on two filesystems. However, these are not links (just two files with the same name and same content).

The problem is that it may be necessary to create links across filesystems. One reason is that you might want to create a path with a much shorter name that easier to remember. Or perhaps you have several remote filesystems, accessed through NFS and you want to create a common structure on multiple machines all of which point to the same file. Therefore you need a mechanism that allows links across filesystems (even remote filesystems). This is the concept of a soft or symbolic link.

Symbolic links were first introduced to SCO UNIX in release 3.2.4.0. In SCO OpenServer they are (perhaps) the primary means of referring to installed files. For more information on referencing installed files, see the section on Software Storage Objects. Unlike hard links, symbolic links take up data blocks on the hard disk and therefore have a unique inode number. However, they only need one data block as the contents of that block is the path to the file you are referring to. Note that If the name is short enough the symbolic link may be stored directly in the inode. See Table 0.4 for details on filesystem characteristics.


Figure 0-2 Symbolic link

For example, if I had a file on a /u filesystem named /u/data/albert.letter. I could create a symbolic link to it as /usr/jimmo/letters/albert.letter (no, it doesn't have to have the same name). The one data block assigned to the symbolic link /usr/jimmo/letters/albert.letter contains /u/data/albert.letter. Whenever I access the file /usr/jimmo/letters/albert.letter, the system (the file system driver) knows that this is a symbolic link. The system then reads the full path out of the data block and accesses the "real" file. Since the data file contains only a path, you could have filesystem mounted via NFS where the data is stored on a remote machine. Whatever you are using to access that file (e.g. an application, a system utility) cannot tell the difference.

For example, I might have a file in my own bin directory that points to a nifty utility on my friend's machine. I have a filesystem from my friend's machine mounted to my /usr/data directory. I could create a symbolic like this:

ls -s /usr/data/nifty /usr/jimmo/bin/nifty

I would therefore have a symbolic link, /usr/jimmo/bin/nifty that looked like this:

lrwxrwxrwx 1 root other 15 May 03 00:12 /usr/jimmo/bin/nifty -> /usr/data/nifty

We see two ways that this is a symbolic link. First, the first character of the permissions field is an 'l'. Next, the name of the file itself is different than we are used to. Next, we see that the name of the file that we use (/usr/jimmo/bin/nifty) and a (sort-of) arrow that points to the "real" file (/usr/data/nifty). Note that there is nothing here that tells us that a remote filesystem is mounted onto /usr/data. The conversion is accomplished by the filesystem driver when the actual file is accessed.

If you were to use just the ls command, then you would not see either the type of file (l) or the ->, so there is no way to know that this is a symbolic link. If you use lf, then the file is followed by an at-sign (@), which tells you that the files is a symbolic link.

Keep in mind that when the system determines that you are trying to access a symbolic link, the system then goes out and tries to access the "real" file and behaves accordingly. Therefore, symbolic links can also point to directories, or any other kinds of files, including other symbolic links.

Be careful when making a symbolic link. When you do, the system does not check to see that the source file exists. It is therefore possible to have a symbolic link point back to itself or to point to nothing. Although most system utilities and commands can catch things like this, do not rely on it. Besides, what's the point of having a dog chasing its own tail. It is also advisable not to use any relative paths when using symbolic links. This may have unexpected results when accessing the links from elsewhere on your system.

Let's go back to the actual structure of the directory entries for a minute. Remember that directories are simply files that have a structure imposed on them by something. If the command or utility imposes the (correct) structure, then each directory entry takes the form of 2 bytes for the inode and 14 bytes for the file name itself.

Another change to the system that came in SCO UNIX 3.2.4.0 was the introduction of long filenames. Up to this point, file names were limited to 14 characters. With two bytes for the inode, 64 of these 16 bytes structures fit exactly into a disk block. However, with only 14 bytes for the name. This often made giving files meaningful names difficult. I don't know how many times I personally spent time trying to remove vowels and otherwise squish the name together so that it fit in 14 characters. The default filesystem on SCO UNIX 3.2.4.0 changed all that.

One thing I liked about having 16 bytes was that a directory entry fit nicely into the default output of hd. That way you could easily see the internal structure of the directory. I don't know how many times I used hd when talking with customers with filesystem problems. However, the hd included in the initial release of Open Server won't let you do this. In my opinion, removing that very useful functionality broke hd.

Up to 3.2.4.0, SCO UNIX used the Acer File System (AFS), which had some advantages over the standard UNIX (S51K) filesystem. However, neither can handle symbolic links and long file names. The Extended Acer File System (EAFS) changed that. Since the directory entries of the AFS were 16 bytes long, long file names have to "'spill over" into subsequent entries in the directory. Since a file only has one inode, extended file names beyond 14 characters need to extend into consecutive entries in the directory. Since they are taking up multiple slots, all but the last inode entry has the inode number of '0xffff'. This indicates the file name continues on in the next slot. Even with long file names, files names on an EAFS limited to 255 characters

When files are removed, the inode entry in the directory is changed to 0. Do an hd of the directory (if you're running ODT) and you still see the file name, but the inode is 0.. When a new file is created, the file name takes up a slot used by an older, previously removed file if the name can fit. Otherwise it must take a new slot. Since long names need to be in consecutive slots, they may not be able to take up empty slots. If so, new entries may need to be created for longer file names.

When you create a file, the system looks in the directory for the first available slot. If this is an EAFS, then it is possible that the file you want to create might not fit in the first slot. Remember that each slot is 16 bytes long. Two for the inode number and 14 for the file name. If, for example, slots 16 and 18 are filled, and slot 17 is free, a file name that is longer than 14 characters cannot fit there. This is because the directory entries must be contiguous.

The system must therefore, either find a slot large enough or create new slots at the end of the directory. For example, if slots 14 and 18 were taken, but slots 15-17 were free, any file less than 42 characters (14*3) would fit. Anything larger would need to go somewhere else.

If you were to count up all the bytes in the inode structure in ino.h, you'd find that each inode is 64 bytes. This means that there are 16 per disk block (16*64=512). In order to keep from wasting space, the system will always create filesystems with the number of inodes being a multiple of 16.

Inode 1 is always at the start of the 3rd block of the filesystem (bytes 2048- 2111) and is reserved (not used). Inode 2 is always the inode of the root directory of any filesystem. You can see for the root filesystem this by doing ls -id /. (The -d is necessary so you only see the directory and not the contents) `

The total number of inodes on an AFS or EAFS is defined when filesystem is created by mkfs(ADM). Normally, you use the divvy command (by hand or through SCOAdmin in OpenServer to create filesystems. The divvy command will then call mkfs to create the filesystem for you. The number of inodes created is based on an average file size of 4K. If you have a system that has many smaller files, such as a mail or news server, you could run out of inodes and still have lots of room on your system.

Therefore, if you have a news or mail server, it is a good idea to use mkfs by hand to create the filesystem before you add any files. Remember that the inode table is at the beginning of the filesystem and takes up as much room as it needs for a given number of inodes. If you want to have more inodes, you must have a larger inode table. The only place for the inode table to grow is into your data. Therefore, you would end up overwriting data. Besides, running mkfs 'zeroes' out your inode table so the pointers to the data is lost anyway.

Among other things that the inode keeps track of are file types and permissions, number of links, owner and group, size of the file and when it was last modified. In the inode is where you will find thirteen pointers (or triplets) to the actual data on the hard disk.

Note that these triplets pointers to the data and not the data itself. Each one of the thirteen pointers to the data is a block address on the hard disk. For the following discussion, please refer to Figure 0-3.

Each of these blocks is 1024 bytes (1k), therefore the maximum file size on an SCO UNIX system is 13Kb. Wait a minute! That doesn't sound right, does it? In fact it isn't. If (and that's a big if) all of the triplets pointed to data blocks, then you could only have a file up to 13Kb. However, there are dozens of files in the /bin directory alone that are larger than 13Kb. How's that?

The answer is that only the first ten of these triplets point to actual data. These are referred to as direct data blocks. The 11th triplet, points to a block on the hard disk which actually contains the real pointers to the data. These are the indirect data blocks and contain 4-byte values, so there are 256 of them in each block. In Figure 0-3, the 11th triplet contains a pointer to block 567. Block 567 contains 256 pointers to indirect data blocks. One of these pointers points to block 33453, which contains the actual data. Block 33453 is an indirect data block.

Since the data blocks pointed to by the 256 pointers in block 567 each contain 1K of data, there is an additional 256K of data. So, with the 10K for the direct data blocks and the 256K for the indirect data blocks, we now have a maximum file size of 266K.

Hmmm. Still not good. Although there aren't that many, there are files on your system larger than 266K. A good example is /unix. So, that brings us to triplet 12. This points not to data blocks, not to a block of pointers to data blocks, but to blocks that point to blocks that point to data blocks. These are the doubly-indirect data blocks.

In Figure 0-3 the 12th triplet contains a pointer to block 5601. Block 5601 contains pointers to other blocks. One of which is block 5151. However, block 5151 does not contain data, but more pointers. One of these points to block 56732. It is block 56732 that finally contains the data.

We have a block of 256 entries that each point to a block which each contain 256 pointers to 1024 byte data blocks. This gives us 64Mb, just for the doubly-indirect data blocks. At this point, the additional size gained by the single-indirect and direct data blocks is negligible. Therefore, let's just say we can access over 64Mb. Now, that's much better. You would be hard pressed to find a system with files larger than 64Mb. (Unless we are talking about large database applications) However, we're not through, yet. We have one triplet left.

So, as not to bore too many of you, let's do the math quickly. The last triplet points to a block containing 256 pointers to other blocks, each of which point to 256 other blocks. At this point, we already have 65536 blocks. Each of these 65536 blocks contain 256 pointers to the actual data blocks. Here we have 16777216 pointers to data blocks, which gives us a grand total of 17179869184 or 16Gb of data (plus the insignificant 64MB we get from the doubly indirect data blocks). Oh, as you might have guesses, these are the triply indirect data blocks.


Figure 0-3 Inodes Pointing to Disk blocks

In Figure 0-3 triplet 13 contains a pointer to block 43. Block 42 contains 256 pointers, one of which points to block 1979. Block 1979 also contains 256 pointers, one of which points to block 988. Block 988 also contains 256 points. However, these pointers point to the actual data. For example, block 911.

If you are running an ODT 3.0 (or earlier) system, 16Gb is not your actually size limit. This is the theoretical limit place on you by the number of triply indirect data blocks. Since you need to keep track of the size of the file and this is stored in the inode table as a signed long integer (31 bits) the actual limit is 2Gb.

As I mentioned a moment ago, when a file is removed all that is done is the inode is set to 0. However, the slot remains. In most cases this is not a problem. However, when mail gets backed up, for example, there can be thousands of files in the mail spool directories. Each one of these requires a slot within the directory. As a result, the directory files can grow to amazing sizes. I have seen directories where the size of the directory file was over 300,000 bytes. This equates to about 20,000 files.

This brings up a couple of interesting issues. Remember that there are 10 direct data blocks for 10Kb, then 1 singly-indirect for 256K for a total of 266Kb for both single and doubly indirect data blocks. If you have a case where the directory file is exceptionally large, and the file you are looking for happens to be at the very end of the directory file, the system must first read all 10 direct data blocks, then read the 11th block that points to the single-indirect data blocks, then read all 64 of those data blocks, then it reads the 12th block in the inode to find where the data blocks are for the pointers are, then reads the blocks containing the pointers, then reads the actual data blocks for the remainder of the directory file. Since a copy of the inode is read into memory, there is no need to go back out to the disk.

On the other hand, remember there are 64 blocks containing the singly-indirect pointers. Each one of them has to be read, then each of the blocks they point to has to be read to check to see if your file is there. Then you need to read the data blocks that point to the data blocks that point to where your directory is. Only then do you find out that you mis-typed your file name and you have to do it all over again.

Since the system can usually get them all in one read, it is best to keep the number of files in a directory at 638 or less. 638? Sure. Each block can hold 64 entries. There are 10 data blocks, so the 10 direct data blocks can hold 640 entries. Each directory always contains the entries . and .., therefore you can only have 638 addition entries.

The next interesting thing is what happens when you run fsck on your system. If the filesystem is clean, there won't be a problem. What happens if you have a system crash and your filesystem becomes corrupted? If during the check, fsck finds files that are pointed to by inodes, but does not find any reference to them in a directory, it will place them in the /lost+found directory. When each file system is created, the system automagically creates 62 files in there and then removes them. This leaves 62 empty directory slots. 62 files plus . and .., which gives you 64 total entries times 16bytes =1024 bytes or one data block.

The reason for the lost+found directory is that you don't want the system to be writing anything to a filesystem that you are trying to clean. It is safe enough to be filling in directory entries, but you don't want the system to be creating any new files while trying to clean the filesystem. This is what would happen if you had more than 62 "lost" files.

If you have a trashed filesystem and there are more than 62 lost files, they really become lost. The system cannot handle the additional files and has to remove them. Therefore, I think it is a good idea to create additional entries and then remove them whenever creating a new file system. This way you are prepared for the worse. A script to do this would be:

cd /lost+found

for i in a b c d e f g h i j

do

for j in a b c d e f g h i j

do

for k in a b c d e f g h i j

do

touch $i$j$k

done

done

done

rm *

This scripts creates 1000 files and then removes them. This takes up about 16K for the directory file, however it allows 1000 files to become "lost”, which may be a job-saver in the future. Make sure that the rm is done after all the files are created, otherwise you end up creating a file, removing it, then filling the slot with some other file. The result is that you have fewer files than you expected.

If you look in /usr/lib/mkdev/fs, (what is actually run when you run mkdev fs) you see that the system does something like this for your every time you add a filessystem. Just after you see the message:

Reserving slots in lost+found directory ...

the mkdev fs script does something very similar. The key difference is that mkdev fs only creates 62 entries. If you wanted to create 1000 entries every time you ran mkdev fs, you could change that part of mkdev fs to look the above script.

Something that I always found interesting was that /bin/cp, /bin/ln and /bin/mv are all the same binary. That is, they are all links to each other. When you link a file, all that needs to get done is to create a new directory entry, fill it in with the correct inode and then increase the link count in the inode table. Copying a file also creates a new directory entry, but it must also write the new data blocks to the disk.

When you move a file, something interesting happens. First, the system creates a link to the original file. It then removes the old file name by unlinking it. This simply clears the directory entry by setting the inode to 0. However, once the system creates the link, for a brief instant there are two files on your system.

When you remove a file, things get a little more complicated. We need to not only remove the directory entry referencing the file name, but we also need to decrease the link count in the inode table. If the link count at this point is greater than 0, then the system knows that there is another file on the system pointing to the same data blocks. However, if the link count reaches 0, we then know that there are no more directory entries pointing to that same file. The system must then free those data blocks and make then available for other files.

Some of you might have realized that special device files (device nodes) do not take up any space on the hard disk. The only place they "exist" is in the directory entry and inode table. You may also have noticed that in the inode structure there is no entry for the major and minor number. However, if you do a long listing of device node, you will see the major and minor number. Where is this kept?

Well, since you don't have any data blocks to point to, then the 39 bytes used for the data block pointers are unused. This is exactly where the major and minor number are stored. The first byte of the array is the major number and the second byte is the minor number. This is one reason why major and minor numbers cannot be larger than 255.

As with many aspects of the system, the kernel's role in administering and managing filesystem is wide reaching and varied. Among its tasks is the organization of disk space within the filesystem. This function is different, depending on what type of filesystem you are trying to access. For example, if you are copying files to a DOS FAT filesystem, then kernel has to be aware that there are different cluster sizes depending on the size of the partition. (A cluster is a physical grouping of data blocks).

If you have an AFS (Acer Fast file System) or EAFS (Extended AFS), then the kernel attempts to keep data in logically contiguous blocks called clusters (on most modern hard disks, this also means physically contiguous). By default, a cluster is 16kb, but can be changed when the filesystem is created by using mkfs.

When reading data off the disk, the system can read clusters rather than single blocks. Since files are normally read sequentially, the efficiency of each read is increased. This is because the system can read larger chunks of data and doesn't have to go looking for them. Therefore, the kernel can issue fewer (but larger) disk requests. If you have a hard disk controller that does "track caching" (storing previously read tracks), you improve your read efficiency even more.

However, the number of files may eventually grow to the point where storing data in 16K chunks is no longer practical. If there are no more free areas that are at least 16Kb, the system would have to being moving things around to make a 16Kb block available. This would waste more time than would be gained by maintaining the 16Kb cluster.

Therefore, these chunks will need to be split up. As the file system gets fuller, the amount the chunks are split up (called fragmentation) increases. Therefore, the system ends up having to move to different places on the disk to find the data. Because of this the kernel ends up sending multiple requests, slowing down the disk reads even further. (It's always possible since you can move data blocks from other files. However, this takes time and is therefore not practical.)

Figure 0-4 Disk fragmentation

The kernel is also responsible for the security of the files themselves. Because SCO UNIX is a multi-user system, it is important to ensure that users only have access to the files that they should have access to. This access is on a per file basis in the form of the permissions set on each file. Based on several discussions we've had so far, we know that these permissions tell us who can read, write or execute our files. It is the kernel, that makes this determination. The kernel also imposes the rule that only the owner or the all-powerful root may change the permissions or ownership of a file.

Allocation of disk blocks is dependent upon organization of what is called the freelist. When a file is opened for the first time, its inode is read into the kernel generic inode table. This is a "generic” table as it is valid for all filesystems. Therefore, on subsequent reads and writes this information is already available and the kernel does not have to make an additional disk read to get the inode information. Remember, it is the inode that contains the pointers to the actual data blocks. If this information were not kept in the kernel, every time the file was accessed this information would need to be read from the hard disk.

Keep in mind that if you have a process that is reading or writing to the disk, it is the kernel that does the actual disk access. This is done through the filesystem and hard disk drivers. Every time the kernel does a read of a block device, the kernel first checks the buffer cache to see if the data already exists there. If so, then the kernel has saved itself a disk read. Obviously if it's not there, the kernel must read it from the hard disk.

At first this seems liked a waste of time. I mean, checking one place and then checking another. Every single read checks the buffer cache first. So, in many cases, this is wasted time. True. However, the buffer cache is in RAM. This can be several hundred times faster than accessing the hard disk. As a result of the principle of locality, your process (and the kernel as well) will probably be accessing the same data over and over again. Therefore, the existence of the buffer cache is actually a great time saver, since the number of times it finds something in the cache (the hit ratio) is so high.

When writing a file (or parts of a file), the data is first written to the buffer cache. If it remains unchanged for a specific period of time (defined by the BDFLUSHR kernel parameter), the data is then written to the disk. This also saves times because if data is written to the disk, then changed before it is read again, you've wasted a disk write. However, if it stays in the buffer cache forever (or until the file is closed, the process terminates, etc) then you run the risk of loosing data is the system crashes. Therefore, BDFLUSHR is set to a reasonable default of 30 seconds.

As I mentioned a moment ago, when a file is first opened, its inode is read into the kernel's generic inode table. (Assuming it is not already there) This table is the same no matter what kind of file system you have (S51K, AFS, etc). The structure of this table is defined in <sys/inode.h>. The size of this is configurable in ODT 3.0 with the kernel parameter NINODE.

The entries in the generic inode table are linked into hash queues. A hash queue is basically a set of linked lists. Which list a particular inode will go into dependents on it's value. This speeds things up, since the kernel does not have to search the entire inode table, but can immediately jump to the relatively smaller hash queue. The more hash queues there are (defined by the NHINODE kernel parameter) the faster things are read since each queue has fewer entries. However, the more queues there are, the more space in memory is required and less room for other things. Therefore, you need to weigh one against the other.

Since there is normally no pattern as to which files a removed from the inode table and when, the free slots in the table are spread throughout the table randomly. Free entries in the generic table are linked onto the freelist so new inodes may be allocated quickly.

One advantage that SCO UNIX provides is the ability to access different kinds of filesystems. Because of this, the kernel must also keep track of filesystem specific information, such as that contained in the inode table. This information is also kept in a kernel internal table, based on the filesystem. The System V dependent inode data structure is defined in <sys/fs/s5inode.h>, and is used by S15K, AFS and EAFS. Other inode tables exist for High-Sierra and DOS. Each time a file is opened an entry is allocated in both the generic and the System V dependent inode table (unless already in memory). The information contained in these inode table is going to be different, depending on what kind of filesystem you are dealing with.

When a process wants to access a file, it does so using a system call such as open() or write(). When first writing the code for a program, the system calls that programmers normally use are the same no matter what the file system type. When the process is running and makes one of theses system calls, the kernel maps that system call to operations appropriate for that type of FS. This is necessary since the way a file is accessed under DOS, for example, is different than under EAFS. The mapping information is maintained in a table, one per file system and is constructed during a relink from information in /etc/conf/mfsys.d and /etc/conf/sfsys.d. The kernel then accesses the correct entry in the table by using the FS type as an index into the fstypesw[ ] array

Another table used by the kernel to keep track of open files is the file table. This allows many processes to share the same inode information, and is defined in <sys/file.h>. Because it is often the case that multiple process have the same file open, this saves the kernel time, by not having to look up inode information for each process individually. Once a file is open and is in the file table, the kernel does not have to re-read the inode table

Figure 0-5 Translation from file descriptions to data

By the time the kernel actually has the inode number of a file that you are working with, it has gone through three different reference points. First, there is the uarea of your process that has the translation from your personal file descriptors to the entry in the file table. Next, the file table has the references that point the kernel to the appropriate slot in it's generic inode table. Last, the generic inode table has the pointers to the file system specific inode table. At first, this may seem like a lot of work. However, keep in mind that this is all in RAM. Without this mechanism, the kernel would have to go back and forth to the disk all the time.

The open() system call is implemented internally as a call to the namei() function. This is the name-to-inode conversions. Namei() sets up both the generic inode table entry and filesystem dependent inode table entry. It returns a pointer to the generic table entry. Namei() then calls another function, the routine falloc(), which sets up an entry in the file table to point to the inode in the generic table.

The kernel then calls the ufalloc() routine, which sets up a file pointer in the process's uarea to point to the file table entry set up by falloc(). Finally the return value to open() is index into the file pointer array, known as the file descriptor.

The function of namei() is a bit more complicated than just converting a filename to an inode number. This seems like a simple thing to say, but in practice, there is a lot more to it. Namei() converts the filenames to inodes (not to inode numbers). Obviously it must first get the inode number, but that is a relatively easy chore, since that is the contained within the directory entry of the file.

In order to find out what inode table to read, namei() needs to know on which filesystem a file resides. Simply reading the inode from the directory entry is not enough. As we talked about before, two completely different files can have the same inode provided they are on different file systems. Therefore, even though namei() has the inode number, it still does not know which inode table to read.

In order to find the filesystem, namei() needs to have a complete pathname to the file. A UNIX pathname consists of zero or more directories, separated with '/' terminated by the filename. The total path length cannot be more than 1024 characters. Assuming there is no directory name mentioned when the file is opened, (or only a relative path) namei has to back track a little to get back up to the top of the directory tree.

If not already in memory, the inode corresponding to the first directory in the pathname is read into memory. The directory file is read into memory and the inode/filename pairs are searched for the next directory component. The next directory is read in and the process continues until the actual file is reached. We now have the inode of the file.

With relative paths or no paths at all, we have to back track. That is, in order to find the root directory of the filesystem we are on, we have to find the parent directory of our file, then it's parent and so on until we reach the root.

Looking at this, we see the pathname to inode conversion is time consuming. Each time a new directory is read, there must be a read of the hard disk. In order to speed up things, SCO UNIX caches the directories. The size of the cache is set by the s5cachent kernel tunable parameter and the entries defined in <sys/fs/s5inode.h>. Whenever the kernel searches for a component of the file name, it checks the correct hash queue. In ODT 3.0 the s5cachent structures can't hold more than 14 characters. Therefore, for the long file names possible with EAFS, the kernel must go directly to the disk. Cache hits and misses are recorded and can be retrieved with sar and can monitored.

In a S51K (Traditional UNIX) filesystem, the superblock contains a list of both free blocks and free inodes. There is room for 50 blocks in the free block list and 100 inodes in the free inode list. The structure of the superblock is found in <sys/fs/s5filsys.h>.

When creating a new file, the system examines the array of free inode numbers in the superblock and the next free inode number assigned. Since this list only has 100 entries, they will all eventually get used up. If total number of free inodes drops to zero, the list is filled in with another 100 from the disk. If there ever less than 100 free inodes, then the unused entries are set to 0. In S51K filesystems, the list of free data blocks, the freelist, is ordered randomly. As disk blocks are freed, they are just appended to end of freelist. During allocation of data blocks, no account is made for physical location of the data blocks. This means that there is no pattern to where the files reside on the disk, and can quickly lead to fragmentation. That is, data blocks from one file can be scattered all over the disk.

In AFS and EAFS the freelist is held as a bitmap, where adjacent bits in the map correspond to logically contiguous blocks. Therefore the system can quickly search for sets of bits representing free blocks and then allocate files in contiguous blocks. Logically contiguous blocks (usually physically contiguous blocks) are known as a cluster.

When the filesystem is first created, the bitmap is created by mkfs. There is 1 bit for every data block on the filesystem, so the bitmap is a linear array which says whether a particular block contains valid data or not. Note that this bitmap also occupies disk blocks itself. Actually there is more than one bitmap. There are several which are spaced at intervals of approximately 8192 blocks throughout the filesystem. Since a block contains 1024 bytes, it contains 8192 bits and can therefore map 8192 blocks. There is also an indirect freelist block, which holds a list of the disk block numbers which actually contain the bitmaps.

When a file is created, the entire cluster is reserved for the file. Although this does tend to waste a little space, it reduces fragmentation and therefore increases speed. When kernel reads a block, it reads the whole cluster the file belongs to as well as the next. This is called read ahead.

When a disk block is needed for a new file, the system searches the bitmap for the first free block. If we later need more data blocks for an existing file, the system begins it search starting from the block number that was last allocated for that file. This helps to ensure new blocks are close to existing ones. Note that when a cluster is allocated, not all of the disk blocks may be free (maybe it is already allocated to another file).

The bitmapped freelist of the AFS and EAFS has some performance advantages. First, files are typically located in contiguous disk blocks. These can be allocated quickly from the free list. using i80386 bit manipulation instructions. This means that free areas of the disk can be found in just a few instruction cycles and therefore access speeds up.

Figure 0-6 The AFS freelist

In addition, the freelist is held in memory. The advantage is that this keeps the system from having to make an additional disk access every time the system wants to write new blocks to the hard disk. When kernel issues an I/O request to read from a single disk block, the AFS maps the request so that the entire cluster contain the disk block and following cluster are read from disk.

At the beginning of each filesystem is filesystem specific structure called the superblock. You can find out about the structure of the superblock by looking in <sys/fs/*>. The Sys V superblock is located in 2nd half of first block of filesystems (bytes 512-1023). Since the structure is less than 512 bytes, it contains padding to fill out to 512 bytes. When a filesystem is first mounted, its superblock is read into memory so updates to the superblock don't have to constantly write to the disk.

In order for the structures on the disk to remain compatible with the copies in memory, superblocks and inodes are updated by sync which is started at regular intervals by init. The frequency of the sync is defined by SLEEPTIME in /etc/default/boot, with a default of 60 seconds.

New Filesystems

New Concepts

There are several concepts new to Open Server. The first is intent logging. When this functionality is enabled, filesystem transactions are recorded in a log and then committed to disk. If the system goes down before the transaction is completed, the log is replayed to complete pending transactions. This scheme increases reliability and recover speed since the system need only read the log to be able to bring the system to the correct state. By using this scheme, the time spent checking the filesystem (and repairing it if necessary) can be reduced to just a few seconds, not the several minutes that was required previously, regardless of the filesystem size. There is, however, a small performance penalty since the system has to spend some time writing to the logs.

As changes are being made to any of the control structures (inodes, superblock), the changes are written to a log. Once complete, the transactions is marked as complete. However, if the system should go down before the log is written, it is as if the transaction was never started. If the log is complete, but the transaction hasn't finished, the transaction can either be completed or ignored, depending what fsck considers possible. Obviously if the system goes down after the transaction is complete, then nothing needs to be done.

The location of the log file is stored in the superblock. As a real file it does reside somewhere on the file system, however it is invisible to normal user-level utilities and only becomes visible when logging is disabled.


Intent logging does bring up one misconception in that it does not increase the reliability of the system. Only changes to the control structures are logged. Data is not. The purpose here is to reduce the time it takes to make the system operational again should it go down.

Another new the concept is checkpointing. When enabled the filesystem is marked as "clean" at regular intervals. That is, the pending writes are completed, inodes are updated and, if necessary, the in-core copy of the superblock is written to disk. At this point the filesystem is considered clean. Should the system go down improperly at this point, there is no need to clean the filesystem (using fsck) as it is already clean. However, the data is still cached in the buffer cache, so if it is needed again soon, it is available.

If the system goes down, the contents of the buffer cache are lost, but since they were already written to disk, no data is actually lost. Obviously, anything not written between the last checkpoint and the time the system goes down is lost, but checkpointing does decrease the amount lost as well as speed up the recovery process when the system is rebooted. Again, there is no such thing as a free lunch and checkpointing does mean a small performance loss. Checkpointing is turned on by default on High Throughput Filesystem (HTFS) , EAFS, AFS, and S51K filesystems.

For the best reliability and speed of recovery, it's a good idea to have both logging and checkpointing enabled. Although they both cause slight performance degradation, the benefits outweigh the performance hit. In most cases, the performance loss is not noticed, only the time required to bring the system back up is a lot quicker.


The idea of sync-on-close for the Desktop Filesystem (DTFS) is another way of increasing reliability. Whenever a file is closed, it is immediately written to disk, rather than waiting for the system to write it as it normally would (potentially 30 seconds later). If the system should do down improperly, you have a better chance of not loosing data. Because you are not writing data to the hard disk in large chunks, sync-on-close also degrades performance.

Because I regularly suffer from digitalus enormus (fat fingers), I am often typing in things that I later regret. On a few occasions, I have entered rm commands with wild cards (*, for example) only to find that I had a extra space before the asterisk. As a result, I end up with a nice clean directory. Since I am not that stupid, I built an alias so that every time I used rm it would prompt me to confirm the removal (rm -i). My brother, on the other hand, created an alias where rm copies the files into a TRASH directory, which he needs to clean out regularly. Both of these solutions can help you recover from accidentally erasing files.

OpenServer has adding something, whereby you no longer have to create aliases or other things necessary to keep you from erasing things you shouldn't. This is the idea of file versioning. Not only does file versioning protect you from digitalus enormus, but will also make automatic copies of files for you.

In order for versioning to be used, it must be first configured in the kernel. There are several kernel tunable parameters that are involved. So to change them you either run the program /etc/conf/cf.d/configure or click on the "Tune Parameters..." button in the Hardware/Kernel Manager. (The Hardware/Kernel Manager calls configure). Next select option 10 (Filesystem configuration). Here you will need to set the MAXVDEPTH parameter, which set the maximum number of versions maintained and the MINVTIME parameter which set the minimum time (in seconds) between changes before a file is versioned. Setting MAXVDEPTH to 0 disables versioning. If MINVTIME is set to 0, and MAXVDEPTH to a non-zero value, then versioning will happen no matter how short the time between versions. Versioning is only available for the DTFS and HTFS.

You can also set versioning for a filesystem by using the maxvdepth and minvtime options when mounting. These can be included in /etc/default/filesys (which defines the default behavior when mounting filesystems), or you can specify them on the command line when mounting the filesystem by hand. In addition to that, versioning can be set on a per-directory basis. This is done by using the undelete command. For example,


undelete -s /usr/jimmo/letters


This command line turns on versioning for all the files in the directory /usr/jimmo/letters as well as any child directories. This includes existing files and directories and well as ones created later. Note that even though the filesystem was not mounted with either the minvtime or maxvdepth options, you can still turn on versioning for individual directories, as long as it is configured in the kernel. Also, using the -v option to undelete you can turn on versioning for single files.

When enabled, versioning is performed without the interaction of the users. If you delete or overwrite a file, you usually don't see anything. You can make the existing versions visible to you by setting the SHOWVERSIONS environment variable to 1, and then exporting it.

The means of storing versions is quite simple. The names are appended with a semi-colon followed by the version of the file as in:

letter;12

This would be the 12th version of the file letter since versioning was enabled on the filesystem. Keep in mind that this does not mean that there are 12 versions. The number of available versions is defined by the MAXVDEPTH kernel parameter or mount option. If higher than 12, there just might be 12 versions. However, if set to a lower value you will see at most MAXVDEPTH versions. Also keep in mind that you are are not just mainting a list of changes, but rather complete copies of each file.

For example, let's assume I mounted a filesystem with the option -o maxvdepth=10. The system will then save, at most, 10 versions. After I edit and save a file for a while, the version number might be up to 12. However, I will not be able to see or have access to versions lower than 3, since there are removed from the system.


Different file versions can not only be accessed when making copies or changes to existing file, but also when you remove them. Assume you have the three latest versions of a letter (letter;10, letter;11 and letter;12) as well as the current version letter. If you remove letter, the three previous versions still exist. These can be seen by using the -l (list) option to undelete, either by specifying the file explicitly as in:

undelete -l letter

or if you leave off the file name, you will see all versions of all files. To undelete a versioned file or make the previous version the current one, simply leave off the options. If you repeated use undelete with just the file name, you can backup and make ever older versions the current one. Or, to make things easier, simply copy the older version to the current one, as in:

cp letter\;8 letter

This will make version 8 the current one. (NOTE: The \ is necessary to remove the special meaning of the semi-colon.)

With the first shipping version of OpenServer, there are some "issues" with versioning in that it does not behave as expected. One of the first things I noticed was that changing the kernel parameters MAXVDEPTH and MINVTIME do not turn on versioning. Instead, they allow versioning to be turned on. Without them, you can't get versioning to work at all. When version is enabled, you still need to use undelete -s on the directory.

There is more to it than that. However, I don't want to repeat too much information that's in the manuals. Therefore, take a look at the undelete(C) man-page.


There are other changes that have been made to the system. There is the introduction of a couple new filesystem types as well as adding new features to the old filesystems. Table 0.4 contains an overview of some of the more significant aspects of the filesystems.


Filesystem Type:

Xenix

S51K

AFS

EAFS

HTFS

DTFS

Driver

xx

ht

ht

ht

ht

dt

Max. fs size

2Gb

2Gb

2Gb

2Gb

2Gb

2Gb

Max. file size

2Gb

2Gb

2Gb

2Gb

2Gb

2Gb

Max. inodes

216

216

216

216

227

231

Clustering

no

no

yes

yes

yes

yes

Long filenames

no

no

no

yes

yes

yes

Symbolic links

no

no

no

yes

yes

yes

Bootable

yes

yes

yes

yes

no

no








New functionality in OpenServer 5







Symbolic links in inode

no

no

no

no

yes

yes

Intent logging

no

no

yes

yes

yes

no

Fast filesys. check

no

no

yes

yes

yes

no

Lazy block list evaluation

no

yes

yes

yes

yes

no

Temporary fs

no

no

yes

yes

yes

no

Checkpointing

no

no

yes

yes

yes

yes

Versioning

no

no

no

no

yes

yes

Table 0.4 Filesystem Characteristics

High Throughput Filesystem

New to OpenServer is the introduction of a new filesystem device driver: ht. This new driver can handle filesystems with 16-bit inodes like S51K, AFS and EAFS, but also the new HTFS which can handle 32-bit inodes. Although (as of this writing) you cannot boot from an HTFS, it does provide some important performance and functionality gains.

One area that was changed is the total amount of information that can be stored on a single HTFS as the total number of inodes that can be used. Table 0.4 contains a comparison of the various filesystem types and just how much data they can access.

Another new feature of the ht driver is lazy block evaluation. Previously, when a process was started with the exec() system call, the system would build a full list of the blocks that made up that program. This delayed the actual start-up of the process, but save time as the program ran. Since a program spends most of it's time executing the same instructions, much of the program is not used. That is, many of the blocks end up never being referenced. What lazy block evaluation does is to build this list of blocks only as they are needed. This speeds up the start-up of the process and causes small delays when a previously unreferenced block is first access.

Another gain is through "transaction based" processing of the filesystem. As activity occurs on the system, they are gathered together in what is called an intent log, which we talked about earlier. If the system stops improperly, the system can use the intent log to make a determination of how to proceed. Since you only need to check the log in order to clean the filesystem, it is quicker and also more reliable.

Another mechanisms used to increase throughput is to disable checkpointing. This way, the filesystem will spent all of it's time processing requests rather than updating the filesystem structures. Although this increases throughput, you obviously have the disadvantage of potentially loosing data.

When dealing with aspects of the system like the print spooler or the mail system when jobs are batched processed, at any given moment it is less likely that data is being processed. Therefore you do not need the extra overhead of checkpointing.

This is done by treating the filesystem as "temporary". Such filesystems are mounted with the -o tmp option. Although checkpointing is new to OpenServer, you can configure both AFS and EAFS filesystems as temporary. Keep in mind that certain applications like vi' provide their own recovery mechanism by saving the data at regular intervals. If the files are written by vi, but not written to disk, a system crash could loose the last update.

When I described the directory structure I mentioned that each inode was represented by two bytes. This allows only for 64K worth of inodes. Since the HTFS can access 227 inodes and the DTFS can access 231, there needs to be some other format used in the directories. With the two new filesystems, the key word is "extensible." This mean that the structure can be extended as the requirement changes. This allows much more efficient handling of long files names, as compared to the EAFS. In most cases, the filesystem driver is capable of making the translation for applications that don't understand the new concepts. However, if the applications reads or write the directory directly, you may need a newer version of the application.

The two new filesystems, HTFS and DTFS, can save space by storing symbolic links in the inode. If the path of the symbolic link is 108 characters or less, the DTFS will store the path within the inode and not in a disk block on the disk. For the HTFS, this limit is 52 characters. First this saves space as not data blocks are needed, but it also saves times since once you read the inode from the inode table, you have the path and do not need to access the disk.


There are two issue to keep in mind. If you use relative paths instead

of absolute paths, then you may end up with a shorter path that fits into the inode. This saves time when accessing the link. On the other hand, think back to our discussion on symbolic links. The behavior of each shell when crossing the links is different. If you fail to take this into account, you may end up somewhere other than you expect.


Desktop Filesystem

One of the problems that the advances that SCO OpenServer brought with it is the increased amount of hard disk space required to install it. On large servers with several gigabytes of space, this is less of an issue. However, on smaller desktop workstations this can become a significant problem.

Operating systems have been dealing with this issue for years. MS-DOS provides a potential solution in the form of it's DoubleSpace disk compression program. Realizing the need for such space savings, SCO Open Server provides a solution in the form of the new DTFS. Among the issues that need to be addressed is not only the saving of space, but also the reliability of the data and avoiding any performance degradation that occurs when compressed files need to be uncompressed. On fast CPUs with fast hard disks, the preformance hit because of the compression is noticeable.

The first issue (saving space) is addressed by the DTFS in a couple of ways. The first is that files are compressed before they are written to the hard disk. This can save anywhere from just a few percent in the case of binary programs to over 50% for text files. What you get will depend on the data you are storing.

The second method space is saved is the way inodes are store on the disk. With "traditional" filesystems such as S51K or EAFS, inodes are pre-allocated. That is, when the filesystem is first created, a table for the inodes allocated at the beginning of the filesystem. This is a consistent size no matter how many or how few inodes are actually used. Inodes on a DTFS are allocated as needed. Therefore, there are only as many inodes as there are files.

As a result you never have any empty slots in the inode table. (Actually there is no inode table in the form we discussed for other filesystems. We'll get to this in a moment.) In order to distinguish these inodes from others, inodes on the DTFS are referred to as dtnodes.

Figure 0-7 The DTNODE map

The DTFS has many of the same features as the EAFS filesystem, such as file length up to 255 characters and symbolic links. In addition, the DTFS also has multiple compression algorithms, greater reliability (through the integrated kernel update daemon, which attempts to keep the file system in a stable state), and dynamic block allocation algorithm that can automatically switch between best-fit and first fit. Best-fit is where the system looks for an appropriately size spot on the hard disk for the file and first-fit is where the system looks for the first one that is large enough (even if it is much larger than necessary).

As one might expect, the disk layout is different from other filesystems. The first block (block 0) was historically the "boot block" and has been retained for compatibility purposes. The second block (block 1) is the super block and like other filesystems it contains global information about the filesystem.

Following the superblock is the block bit-map. There is one block for each 512-byte data block in the filesystem, so the size of the bitmap will vary depending on the size of the filesystem. If the bit is on (1), the block is free, otherwise the block is allocated.

The block bitmap is followed by the dtnode bitmap. It's size is the same as the block bitmap since there is also one bit for each block. The difference is that these bits determine if the corresponding block contains data or dtnodes. A 1 indicates the block contains dtnodes and a 0 indicates data. Following these two bitmaps are the actual data and dtnode blocks. Since the dtnodes are scattered throughout the filesystem, there is no inode table.

Unlike the inodes of other filesystems, dtnodes are not pre-allocated when the filesystem is created. Instead, they are allocated at the same time as the corresponding file. This has the potential for saving a great deal of space since every dtnode points to data in contrast to other filesystems where inodes may go unused and therefore the space they occupy is wasted.

The translation from dtnode number is straight forward. The dtnode number has the same number as the block number that it resides on. For example, if block 1256 was a dtnode, then that dtnode number would be 1256. This means that since not all blocks contain dtnodes, not all dtnode numbers are used. The one exception to this is that the dtnode number of the root of the filesystem is stored in the superblock. Each dtnode is accessed through the dtnode map.

The contents of the superblock are found in the files location in <sys/fs/ >. If you take a quick look at it you see