Jim Mohr's SCO Companion
Copyright 1996-1998 by James Mohr. All rights reserved. Used by permission of
Be sure to visit Jim's great Linux Tutorial web site at http://www.linux-tutorial.info/
File Systems and
Any time you access an SCO system, whether locally, across a network
or through any other means both files and filesystems are involved.
Every program that you run starts out as a file. Most of the time you
are also reading or writing a file. Since files (whether programs or
data files) reside on filesystems, every time you access the system
you are also accessing a filesystem.
Knowing what a file is and how it is represented on the disk and how
the system interprets the contents of the file is useful to help your
understanding of what the system is doing. You can also use this
understanding to evaluate both the system and application behavior to
determine if it is proper.
to be able to access data on your hard disk, there has to be some
pre-defined structure. Without structure, it ends up looking like my
desk where there are several piles of papers and I have to look
though every pile in order to find what I am looking for. Instead,
the layout of a hard disk follows a very consistent pattern. So
consistent, that it is even possible for different operating systems
to share the hard disk.
Basic to this structure is the concept of a partition. A
partition defines a portion of the hard disk to be used by one
operating system or another. The partition can be any size, including
the entire hard disk. Near the very beginning of the disk is the
partition table. The partition table is only 512 bytes, but
can still define where each partition begins and how large it is. In
addition, the partition table indicates which of the partitions is
active. This decides which partition the system should go to
when looking for an operating system to boot. The partition table is
outside of any partition.
Once the system has determined which partition is active, the CPU
knows to go to the very first block of data within that partition and
begin executing the instructions there. On an SCO system this is an
area called boot0.
Although there is only 512 bytes of data in boot0,
there is 1024 bytes reserved for it. The code within boot0
is sufficient to execute the code in the next block, boot1.
Here, 20Kb are reserved, although the actually code is
slightly less. The code within boot1
is what reads the /boot
program, which will eventually load the kernel.
Immediately after boot1
is the division table. Under SCO, a division is a unit
of the hard disk contained within a partition. A division can be any
size, including the entire partition. Often, special control
structures are created at the beginning of the division that impose
an additional structure on that division. This structure makes the
division a filesystem. In order to keep track of where each
division starts and how big it is, the system uses the division
table. The division table has functionality similar to that of a
partition table, although there is no such thing as an "active"
division. There can be up to seven divisions (and therefore 7
filesystems) per division, but the size of the division table is
fixed at 130 bytes although 1024 bytes are reserved for the table.
Just after the division table is the bad track table. A
bad track is a portion of the hard disk that has become unusable.
Immediately following the bad track table is an area that is used for
alias tracks. These are the tracks that are used when one of
the other tracks goes bad. If that occurs, the operating system marks
the bad track as such in the bad track table and indicates which of
the alias tracks will be used. The size of the area taken up by the
alias tracks is determined by how many entries are in the bad track
table. (There is one alias track per table entry) You can see the
contents of your bad track table by using the badtrk
utility. Once the table and alias tracks have been defined, you
cannot increase the number without re-installing.
Just after the bad track table are the divisions. If you have one of
the older SCO UNIX filesystems (AFS, EAFS), there are two control
structures at the beginning of the filesystem: the superblock
and the inode table. The superblock contains
information about the type of filesystem, it's size, how many data
blocks there are, the number of free inodes, free space available and
where the inode table is.
Many users at not aware of the fact that different filesystems reside
on different parts of the hard disk and in many cases on different
physical disks. From the user's perspective the entire directory
structure is one unit from the top (/) down to the deepest
sub-directory. In order to carry out this deception, the system
administrator needs to mount filesystems. This is done
by mounting the device node associated with the filesystem (e.g.
/dev/u ) onto a
mountpoint (e.g. /u).
This can either be done by hand, with the mount
command line or by having the system do it for you when booting. This
is done with entries in /etc/default/filesys.
See the mount(ADM)
and the filesys(F) man-pages
for more details.
Conceptually, the mountpoint serves as a detour sign for the system.
If there is no filesystem mounted on the mountpoint. The system can
just drive through and access what's there. If a filesystem is
mounted, when the system get to the mountpoint is sees the detour
sign and is immediately divert in another directions. Just as the
road, treess and houses still exist on the other side of the detour
sign, any file or directory that exists underneath the mountpoint is
still there. You just can't get to it.
Let's look at an example. We have the /dev/u
filesystem which we are mounting on /u.
Let's say tha when we first installed the system and before we first
mount the /dev/u
filesystem, we created some users with their home directories in /u.
For example, /u/jimmo.
When we do finally mount the /dev/u
fileystem onto the /u
directory, we no longer see /u/jimmo.
It is still there, however, once the system reaches the /u
directory it is redirected somewhere else.
This brings up an interesting phenomena. When you use find
to locate a file, it will reach the mount point and get
redirected. However, nheck
is not file and directory oriented, but rather filesystem oriented.
If you used find you would
not see /u/jimmo. However,
you would if you used ncheck!
When a filesystem is mounted, the kernel reads the filesystem's
superblock into an internal copy of the superblock. This way, the
kernel doesn't have to keep going back to the hard disk for this
Figure 0-1 Boot
Hard Disk Layout
The inode is (perhaps) the most important structure. It
contains all the information about a file including, owner and group,
permissions, creation time, and most importantly: where the data
blocks are on the hard disk. The only thing it's missing is the name
of the file. That's stored in the directory and not in the inode.
If you have an Desktop Filesystem (DTFS), then there is no inode
table. Rather the inodes are scattered across the disk. How they are
accessed, we'll get into later when we talk about the different
After the superblock (and inode table, if there is one) you get to
the actual data. Data is stored in a system of files within each
filesystem (hence the name). As we talked about before in the section
on SCO basics, files are grouped together into directories. This
grouping is completely theoretical in the sense that there is nothing
physically associating the files. Not only can files in the same
directory be spread out across the disk, it is possible that the
individual data blocks of a file are scattered as well.
Figure 0-1 shows you where all the structures are on the hard disk.
In most systems, there will be at least two divisions on your
root hard disk. On ODT 3.0 systems these divisions will contain your
root filesystem and your swap space. Although it takes up a
division, just like the root filesystem, your swap space is not a
filesystem. This is because it has none of the control
structures(superblock, inode table) that make it a filesystem.
Despite this, there must still be an entry in the division table for
it. In OpenServer, there is a new filesystem at the beginning of the
partition and the root filesystem is moved to place after the swap
space. We'll go into more details later. (NOTE: That whether you have
a the extra division will depend on what kind of installation you
did. We'll cover this in more detail in chapter 13.
Up to this point we've talked a great deal about both files and
directories, where they reside and what their attributes
(characteristics) are. Now it's time to talk about the concepts of
files and directories. We need to talked about how the operating
system sees files and directories and how they system manages them.
From our discussion of how a hard disk is divided, we know
that files reside within filesystems. Each filesystem has special
structures that allow the system to manage the files. These are the
superblock and inodes. The actual data is stored somewhere in
the filesystem in datablocks. Most SCO UNIX filesystems use a
block size of 1024 bytes. If you have OpenServer, the new DTFS has a
variable block size.
Every SCO UNIX filesystem uses inodes, which, as I mentioned
earlier, contain the important information about a file. (In some
books, inode is short for information node and in others it is short
for index node.) Although the structure of the inodes is different
for each filesystem, they hold the same kinds of information. What
each contains can be found in <sys/ino.h>,
<sys/inode.h> and <sys/fs/*>.
Each inode has pointers which tell it where the actual data is
located. How this is done is dependent on the filesystem type and
we'll get to that in a moment. One piece of information that the
inode does not contain is the file name. This is contained
only within the directory.
If you are running OpenServer, then there are at least three
divisions used. The first one (slot 0 in the division table) is used
for the /dev/boot
filesystem. This contains the file that are necessary to load and
start the operating system. Although this is what is used to start
the system, this is not root filesystem. The root filesystem has be
move to the third division. Once the system has been loaded, the
/dev/boot filesystem is
mounted onto the /stand
directory and is accessible like any other mounted filesystem, except
for the fact that it is normally mounted as read-only. In both ODT
3.0 and OpenServer, the root filesystem normally contains most of the
files your operating system uses.
Depending on the size of your primary hard disk and the configuration
options you chose during installation, you may have more than just
these default filesystems. Common configurations included having
separate filesystems for users' home directories or data.
Extended Acer Fast
System V 1KB filesystem
High Sierra CD-ROM
ISO 9600 CD-ROM
Table 0.1 Filesystems Supported by ODT 3.0
Extended Acer Fast
System V 1KB filesystem
High Sierra CD-ROM
ISO 9600 CD-ROM
SCO Gateway for
LAN Manager Client
Table 0.2 Filesystem Supported by OpenServer
Although all of these filesystem are supported, not all are
configured into your kernel by default. If you have ODT 3.0, you
automatically have support for the three standard UNIX filesystem
(EAFS, AFS and S51K) as well as the XENIX filesystem. In OpenServer,
the three UNIX filesystems supported in ODT are includes, as well as
the XENIX filesystems, and the two new ones: HTFS and DTFS. In order
to be recognized they must be first configured in the kernel. How
this is accomplished, depends on what product you are running and
what filesystem. Table 0.1 and Table 0.2 show what filesystems are
If you want to use one of the network filesystem such as NFS, SCO
Gateway for NetWare, and Lan Manager Client Filesystem, you need to
add that product through the Software Manager. This will
automatically add support into the kernel for the appropriate
If you have ODT 3.0, then you can use sysadmsh
to add the driver for each of the other filesystems. On OpenServer,
you use the Hardware/Kernel Manager. In both cases, there are mkdev
scripts that will do this. In fact, these scripts are called by
sysadmsh and the
Hardware/Kernel Manager. Table 0.3 shows you which script is run for
script associate with each filesystem
From the filesystem's standpoint, there is no difference between a
directory and a file. Both take up an inode, use data blocks and have
certain attributes. It is the commands and programs that we use that
impose structure on the directory. For example, /bin/ls
imposes structure when we do listings of directories.
Keep in mind that it is the ls
command that puts the file names "in order." Within the
directory, the file names do not appear in any order.
Initially, files appear in the directory in chronological order. As
files are created and removed, the slots taken up by older files are
replaced by newer ones and even this order disappears.
Other commands, such as /bin/cat
or /bin/hd, allow us to
seeing the directories as files, without any structure. Note that in
OpenServer these commands don't let you see the structure,
which I see as a lost in functionality.
When you do a long listing of a file, (ls
-l or l) you can
learn a lot about the characteristics of a file and directory. For
example, if we do a long listing of /bin/date,
-rwx--x--x 1 bin bin
17236 Dec 14 1991 /bin/date
We can see the type of file we have: the '-' in the first position
says its a regular file, the access permissions (rwx--x--x),
how many links it has (1 -
we'll talk more about these in a moment), the owner and group
size(17236), the date it
was last written to (Dec 14, 1991. Maybe the time, as well, if it is
a newer file), and the name of the file (/bin/date).
For additional details on this, see the ls(C)
Unlike operating systems like DOS, most of this information is not
stored in the directory. In fact, the only information that we see
here, which is actually stored in the directory is the file's name.
If not in the directory, where is this this other information kept
and how do you figure out where on the hard disk the data is?
As I mentioned before, this is all stored in the inode. All the
inodes on each file system are stored at the beginning of that
filesystem in the inode table. The inode table is simply an set of
these inode structures. If you want, you can see what the structure
looks like, by taking a peek at <sys/ino.h>.
To access the information in the inode, you need the inode number.
Each directory entry consists of an inode number and file name pair.
On ODT 3.0 and earlier, the first two bytes of each entry were the
inode number. Since a byte can hold 256 values, the maximum possible
inode was 256*256, or 65535 inodes per filesystem. The inode simply
points to a particular entry in the inode table. This is the only
connection there is between a filename and its inode, therefore the
only connection between the filename and the data on the hard
Because this is only a pointer and there is no physical connection,
there is nothing preventing you from having multiple entries in a
directory pointing to the same file. These would have different
names, but have the same inode number and therefore point to the same
physical data on the hard disk. Having multiple file names on your
system point to the same data on the hard disk, is not a sign
of filesystem corruption! This is actually something that is done on
For example, if do a long listing of /bin/ls
(l /bin/ls) you see:
-r-xr-xr-t 6 bin bin
23672 Dec 14 1991 /bin/ls
Here the number of links (Column 2) is 6. That means there are five
other files on the system with the same inode number as
/bin/ls. In fact that's all a link is: a file with the same
inode on the same filesystem. (More on that later)To find out
what inode that is, let's add the -i option to give us:
167 -r-xr-xr-t 6 bin bin
23672 Dec 14 1991 /bin/ls
From this we see that /bin/ls
occupies entry 167 in the inode table. There are three ways of
finding out what other files have this inode number:
find / -inum 167 -print
-i 167 /dev/root — we're assuming /bin/ls is on the
-iR / | grep '167'
Since I know they are all in the /bin
directory, I'll try the last one. This gives me:
167 -r-xr-xr-t 6 bin bin
23672 Dec 14 1991 l
-r-xr-xr-t 6 bin bin 23672 Dec 14 1991 lc
-r-xr-xr-t 6 bin bin 23672 Dec 14 1991 lf
-r-xr-xr-t 6 bin bin 23672 Dec 14 1991 lr
-r-xr-xr-t 6 bin bin 23672 Dec 14 1991 ls
-r-xr-xr-t 6 bin bin 23672 Dec 14 1991 lx
Interesting. This is the entire family of ls
commands. All of these lines look identical, with the exception of
the file name. There are six lines, which matches the number of
links. Each has a inode of 167, so we know that all six names have
the same inode and therefore point the same location on the hard
disk. That means that whenever you execute any one of these commands,
the same program is started. The only difference is the behavior and
that is based on what program you actual start on the command line.
Since the program knows what name is was started with, the program
can change it's behavior accordingly.
There is nothing special about the fact that these are all in the
same directory. A name must only be unique within a single directory.
You can therefore have two files with the same basename in two
separate directories. For example, /bin/mail
and /usr/bin/mail. If you
take a look, these not only have the same inode number (and are
therefore the same file), there are actually three links. The third
link being /usr/bin/mailx.
So, here we have two files in the same directory (/usr/bin/mailx
and /usr/bin/mail) as well
as two files with the same basename (/bin/mail
and /usr/bin/mail). All of
which have the same inode and are, therefore, all the same file.
The key issue here is that all three of these files exists on
the same filesystem, /dev/root.
As I mentioned before, there may be files on other filesystem that
have the same inode. This is the reason why you cannot create a link
between files on two different filesystems. With a little
manipulation, you might be able to force two files with identical
content to have the same inode on two filesystems. However, these are
not links (just two files with the same name and same content).
The problem is that it may be necessary to create links across
filesystems. One reason is that you might want to create a path with
a much shorter name that easier to remember. Or perhaps you have
several remote filesystems, accessed through NFS and you want to
create a common structure on multiple machines all of which point to
the same file. Therefore you need a mechanism that allows links
across filesystems (even remote filesystems). This is the concept of
a soft or symbolic link.
Symbolic links were first introduced to SCO UNIX in release
22.214.171.124. In SCO OpenServer they are (perhaps) the primary means of
referring to installed files. For more information on referencing
installed files, see the section on Software Storage Objects. Unlike
hard links, symbolic links take up data blocks on the hard disk and
therefore have a unique inode number. However, they only need one
data block as the contents of that block is the path to the file you
are referring to. Note that If the name is short enough the symbolic
link may be stored directly in the inode. See Table 0.4 for details
on filesystem characteristics.
For example, if I had a file on a /u
/u/data/albert.letter. I could create a symbolic link to it as
(no, it doesn't have to have the same name). The one data block
assigned to the symbolic link /usr/jimmo/letters/albert.letter
Whenever I access the file /usr/jimmo/letters/albert.letter,
the system (the file system driver) knows that this is a symbolic
link. The system then reads the full path out of the data block and
accesses the "real" file. Since the data file contains only
a path, you could have filesystem mounted via NFS where the data is
stored on a remote machine. Whatever you are using to access that
file (e.g. an application, a system utility) cannot tell the
For example, I might have a file in my own bin
directory that points to a nifty utility on my friend's machine. I
have a filesystem from my friend's machine mounted to my /usr/data
directory. I could create a symbolic like this:
ls -s /usr/data/nifty
I would therefore have a symbolic link, /usr/jimmo/bin/nifty
that looked like this:
lrwxrwxrwx 1 root other 15
May 03 00:12 /usr/jimmo/bin/nifty -> /usr/data/nifty
We see two ways that this is a symbolic link. First, the first
character of the permissions field is an 'l'. Next, the name of the
file itself is different than we are used to. Next, we see that the
name of the file that we use (/usr/jimmo/bin/nifty)
and a (sort-of) arrow that points to the "real" file
that there is nothing here that tells us that a remote filesystem is
mounted onto /usr/data.
The conversion is accomplished by the filesystem driver when the
actual file is accessed.
If you were to use just the ls
command, then you would not see either the type of file (l)
or the ->, so there is no way to know that this is a symbolic
link. If you use lf, then
the file is followed by an at-sign (@), which tells you that the
files is a symbolic link.
Keep in mind that when the system determines that you are
trying to access a symbolic link, the system then goes out and tries
to access the "real" file and behaves accordingly.
Therefore, symbolic links can also point to directories, or any other
kinds of files, including other symbolic links.
Be careful when making a symbolic link. When you do, the
system does not check to see that the source file exists. It is
therefore possible to have a symbolic link point back to itself or to
point to nothing. Although most system utilities and commands can
catch things like this, do not rely on it. Besides, what's the point
of having a dog chasing its own tail. It is also advisable not to use
any relative paths when using symbolic links. This may have
unexpected results when accessing the links from elsewhere on your
Let's go back to the actual structure of the directory entries for a
minute. Remember that directories are simply files that have a
structure imposed on them by something. If the command or utility
imposes the (correct) structure, then each directory entry takes the
form of 2 bytes for the inode and 14 bytes for the file name itself.
Another change to the system that came in SCO UNIX 126.96.36.199 was the
introduction of long filenames. Up to this point, file names were
limited to 14 characters. With two bytes for the inode, 64 of these
16 bytes structures fit exactly into a disk block. However, with only
14 bytes for the name. This often made giving files meaningful names
difficult. I don't know how many times I personally spent time trying
to remove vowels and otherwise squish the name together so that it
fit in 14 characters. The default filesystem on SCO UNIX 188.8.131.52
changed all that.
One thing I liked about having 16 bytes was that a directory
entry fit nicely into the default output of hd.
That way you could easily see the internal structure of the
directory. I don't know how many times I used hd
when talking with customers with filesystem problems. However, the hd
included in the initial release of Open Server won't let you do
this. In my opinion, removing that very useful functionality broke
Up to 184.108.40.206, SCO UNIX used the Acer File System (AFS), which had
some advantages over the standard UNIX (S51K) filesystem. However,
neither can handle symbolic links and long file names. The Extended
Acer File System (EAFS) changed that. Since the directory entries of
the AFS were 16 bytes long, long file names have to "'spill
over" into subsequent entries in the directory. Since a file
only has one inode, extended file names beyond 14 characters need to
extend into consecutive entries in the directory. Since they are
taking up multiple slots, all but the last inode entry has the
inode number of '0xffff'. This indicates the file name continues on
in the next slot. Even with long file names, files names on an EAFS
limited to 255 characters
When files are removed, the inode entry in the directory is changed
to 0. Do an hd of the
directory (if you're running ODT) and you still see the file
name, but the inode is 0.. When a new file is created, the file name
takes up a slot used by an older, previously removed file if the
name can fit. Otherwise it must take a new slot. Since long names
need to be in consecutive slots, they may not be able to take up
empty slots. If so, new entries may need to be created for longer
When you create a file, the system looks in the directory for the
first available slot. If this is an EAFS, then it is possible that
the file you want to create might not fit in the first slot. Remember
that each slot is 16 bytes long. Two for the inode number and 14 for
the file name. If, for example, slots 16 and 18 are filled, and slot
17 is free, a file name that is longer than 14 characters cannot fit
there. This is because the directory entries must be contiguous.
The system must therefore, either find a slot large enough or create
new slots at the end of the directory. For example, if slots 14 and
18 were taken, but slots 15-17 were free, any file less than 42
characters (14*3) would fit. Anything larger would need to go
If you were to count up all the bytes in the inode structure in
ino.h, you'd find that
each inode is 64 bytes. This means that there are 16 per disk block
(16*64=512). In order to keep from wasting space, the system will
always create filesystems with the number of inodes being a multiple
Inode 1 is always at the start of the 3rd block of the filesystem
(bytes 2048- 2111) and is reserved (not used). Inode 2 is always the
inode of the root directory of any filesystem. You can see for the
root filesystem this by doing ls
-id /. (The -d is
necessary so you only see the directory and not the contents) `
The total number of inodes on an AFS or EAFS is defined when
filesystem is created by mkfs(ADM).
Normally, you use the divvy
command (by hand or through SCOAdmin in OpenServer to create
filesystems. The divvy
command will then call mkfs
to create the filesystem for you. The number of inodes created is
based on an average file size of 4K. If you have a system that has
many smaller files, such as a mail or news server, you could run out
of inodes and still have lots of room on your system.
Therefore, if you have a news or mail server, it is a good idea to
use mkfs by hand to create
the filesystem before you add any files. Remember that the
inode table is at the beginning of the filesystem and takes up as
much room as it needs for a given number of inodes. If you want to
have more inodes, you must have a larger inode table. The only place
for the inode table to grow is into your data. Therefore, you would
end up overwriting data. Besides, running mkfs
'zeroes' out your inode table so the pointers to the data is lost
Among other things that the inode keeps track of are file types and
permissions, number of links, owner and group, size of the file and
when it was last modified. In the inode is where you will find
thirteen pointers (or triplets)
to the actual data on the hard disk.
Note that these triplets pointers to the data and not the data
itself. Each one of the thirteen pointers to the data is a block
address on the hard disk. For the following discussion, please refer
to Figure 0-3.
Each of these blocks is 1024 bytes (1k), therefore the maximum file
size on an SCO UNIX system is 13Kb. Wait a minute! That doesn't sound
right, does it? In fact it isn't. If (and that's a big if) all of the
triplets pointed to data blocks, then you could only have a file up
to 13Kb. However, there are dozens of files in the /bin directory
alone that are larger than 13Kb. How's that?
The answer is that only the first ten of these triplets point to
actual data. These are referred to as direct data blocks. The
11th triplet, points to a block on the hard disk which actually
contains the real pointers to the data. These are the indirect
data blocks and contain 4-byte values, so there are 256 of them
in each block. In Figure 0-3, the 11th triplet contains a pointer to
block 567. Block 567 contains 256 pointers to indirect data blocks.
One of these pointers points to block 33453, which contains the
actual data. Block 33453 is an indirect data block.
Since the data blocks pointed to by the 256 pointers in block 567
each contain 1K of data, there is an additional 256K of data. So,
with the 10K for the direct data blocks and the 256K for the indirect
data blocks, we now have a maximum file size of 266K.
Hmmm. Still not good. Although there aren't that many, there are
files on your system larger than 266K. A good example is /unix.
So, that brings us to triplet 12. This points not to data blocks, not
to a block of pointers to data blocks, but to blocks that point to
blocks that point to data blocks. These are the doubly-indirect
In Figure 0-3 the 12th triplet contains a pointer to block 5601.
Block 5601 contains pointers to other blocks. One of which is block
5151. However, block 5151 does not contain data, but more pointers.
One of these points to block 56732. It is block 56732 that finally
contains the data.
We have a block of 256 entries that each point to a block which each
contain 256 pointers to 1024 byte data blocks. This gives us 64Mb,
just for the doubly-indirect data blocks. At this point, the
additional size gained by the single-indirect and direct data blocks
is negligible. Therefore, let's just say we can access over 64Mb.
Now, that's much better. You would be hard pressed to find a system
with files larger than 64Mb. (Unless we are talking about large
database applications) However, we're not through, yet. We have one
So, as not to bore too many of you, let's do the math quickly.
The last triplet points to a block containing 256 pointers to other
blocks, each of which point to 256 other blocks. At this point, we
already have 65536 blocks. Each of these 65536 blocks contain 256
pointers to the actual data blocks. Here we have 16777216 pointers to
data blocks, which gives us a grand total of 17179869184 or 16Gb of
data (plus the insignificant 64MB we get from the doubly indirect
data blocks). Oh, as you might have guesses, these are the triply
indirect data blocks.
Inodes Pointing to Disk blocks
In Figure 0-3 triplet 13 contains a pointer to block 43. Block 42
contains 256 pointers, one of which points to block 1979. Block 1979
also contains 256 pointers, one of which points to block 988. Block
988 also contains 256 points. However, these pointers point to
the actual data. For example, block 911.
If you are running an ODT 3.0 (or earlier) system, 16Gb is not your
actually size limit. This is the theoretical limit place on you by
the number of triply indirect data blocks. Since you need to keep
track of the size of the file and this is stored in the inode table
as a signed long integer (31 bits) the actual limit is 2Gb.
As I mentioned a moment ago, when a file is removed all that is done
is the inode is set to 0. However, the slot remains. In most cases
this is not a problem. However, when mail gets backed up, for
example, there can be thousands of files in the mail spool
directories. Each one of these requires a slot within the directory.
As a result, the directory files can grow to amazing sizes. I have
seen directories where the size of the directory file was over
300,000 bytes. This equates to about 20,000 files.
This brings up a couple of interesting issues. Remember that there
are 10 direct data blocks for 10Kb, then 1 singly-indirect for 256K
for a total of 266Kb for both single and doubly indirect data blocks.
If you have a case where the directory file is exceptionally large,
and the file you are looking for happens to be at the very end of the
directory file, the system must first read all 10 direct data blocks,
then read the 11th block that points to the single-indirect data
blocks, then read all 64 of those data blocks, then it reads the 12th
block in the inode to find where the data blocks are for the pointers
are, then reads the blocks containing the pointers, then reads the
actual data blocks for the remainder of the directory file. Since a
copy of the inode is read into memory, there is no need to go back
out to the disk.
On the other hand, remember there are 64 blocks containing the
singly-indirect pointers. Each one of them has to be read, then
each of the blocks they point to has to be read to check to see
if your file is there. Then you need to read the data blocks that
point to the data blocks that point to where your directory is. Only
then do you find out that you mis-typed your file name and you have
to do it all over again.
Since the system can usually get them all in one read, it is best to
keep the number of files in a directory at 638 or less. 638? Sure.
Each block can hold 64 entries. There are 10 data blocks, so the 10
direct data blocks can hold 640 entries. Each directory always
contains the entries . and .., therefore you can only have 638
The next interesting thing is what happens when you run fsck
on your system. If the filesystem is clean, there won't be a problem.
What happens if you have a system crash and your filesystem becomes
corrupted? If during the check, fsck
finds files that are pointed to by inodes, but does not find any
reference to them in a directory, it will place them in the
directory. When each file system is created, the system
automagically creates 62 files in there and then removes them. This
leaves 62 empty directory slots. 62 files plus . and .., which gives
you 64 total entries times 16bytes =1024 bytes or one data block.
The reason for the lost+found
directory is that you don't want the system to be writing anything to
a filesystem that you are trying to clean. It is safe enough to be
filling in directory entries, but you don't want the system to be
creating any new files while trying to clean the filesystem.
This is what would happen if you had more than 62 "lost"
If you have a trashed filesystem and there are more than 62 lost
files, they really become lost. The system cannot handle the
additional files and has to remove them. Therefore, I think it is a
good idea to create additional entries and then remove them whenever
creating a new file system. This way you are prepared for the worse.
A script to do this would be:
for i in a b c d e f g h i j
for j in a b c d e f g h i j
for k in a b c d e f g h i j
This scripts creates 1000 files and then removes them. This takes up
about 16K for the directory file, however it allows 1000 files to
become "lost”, which may be a job-saver in the future.
Make sure that the rm is
done after all the files are created, otherwise you end up creating a
file, removing it, then filling the slot with some other file. The
result is that you have fewer files than you expected.
If you look in /usr/lib/mkdev/fs,
(what is actually run when you run mkdev
fs) you see that
the system does something like this for your every time you add a
filessystem. Just after you see the message:
Reserving slots in lost+found
the mkdev fs script does
something very similar. The key difference is that mkdev
fs only creates 62 entries. If you wanted to create 1000
entries every time you ran mkdev
fs, you could change that
part of mkdev fs to look
the above script.
Something that I always found interesting was that /bin/cp,
/bin/ln and /bin/mv
are all the same binary. That is, they are all links to each
other. When you link a file, all that needs to get done is to create
a new directory entry, fill it in with the correct inode and then
increase the link count in the inode table. Copying a file also
creates a new directory entry, but it must also write the new data
blocks to the disk.
When you move a file, something interesting happens. First,
the system creates a link to the original file. It then removes the
old file name by unlinking it. This simply clears the directory entry
by setting the inode to 0. However, once the system creates the link,
for a brief instant there are two files on your system.
When you remove a file, things get a little more complicated. We need
to not only remove the directory entry referencing the file name, but
we also need to decrease the link count in the inode table. If the
link count at this point is greater than 0, then the system knows
that there is another file on the system pointing to the same data
blocks. However, if the link count reaches 0, we then know that there
are no more directory entries pointing to that same file. The system
must then free those data blocks and make then available for other
Some of you might have realized that special device files (device
nodes) do not take up any space on the hard disk. The only place they
"exist" is in the directory entry and inode table. You may
also have noticed that in the inode structure there is no entry for
the major and minor number. However, if you do a long listing of
device node, you will see the major and minor number. Where is this
Well, since you don't have any data blocks to point to, then the 39
bytes used for the data block pointers are unused. This is exactly
where the major and minor number are stored. The first byte of the
array is the major number and the second byte is the minor number.
This is one reason why major and minor numbers cannot be larger than
As with many aspects of the system, the kernel's role in
administering and managing filesystem is wide reaching and varied.
Among its tasks is the organization of disk space within the
filesystem. This function is different, depending on what type of
filesystem you are trying to access. For example, if you are copying
files to a DOS FAT filesystem, then kernel has to be aware that there
are different cluster sizes depending on the size of the partition.
(A cluster is a physical grouping of data blocks).
If you have an AFS (Acer Fast file System) or EAFS (Extended AFS),
then the kernel attempts to keep data in logically contiguous blocks
called clusters (on most modern hard disks, this also means
physically contiguous). By default, a cluster is 16kb, but can be
changed when the filesystem is created by using mkfs.
When reading data off the disk, the system can read clusters rather
than single blocks. Since files are normally read sequentially, the
efficiency of each read is increased. This is because the system can
read larger chunks of data and doesn't have to go looking for them.
Therefore, the kernel can issue fewer (but larger) disk requests. If
you have a hard disk controller that does "track caching"
(storing previously read tracks), you improve your read efficiency
However, the number of files may eventually grow to the point where
storing data in 16K chunks is no longer practical. If there are no
more free areas that are at least 16Kb, the system would have to
being moving things around to make a 16Kb block available. This would
waste more time than would be gained by maintaining the 16Kb cluster.
Therefore, these chunks will need to be split up. As the file system
gets fuller, the amount the chunks are split up (called
fragmentation) increases. Therefore, the system ends up having to
move to different places on the disk to find the data. Because of
this the kernel ends up sending multiple requests, slowing down the
disk reads even further. (It's always possible since you can move
data blocks from other files. However, this takes time and is
therefore not practical.)
Figure 0-4 Disk
The kernel is also responsible for the security of the files
themselves. Because SCO UNIX is a multi-user system, it is important
to ensure that users only have access to the files that they should
have access to. This access is on a per file basis in the form of the
permissions set on each file. Based on several discussions we've had
so far, we know that these permissions tell us who can read, write or
execute our files. It is the kernel, that makes this determination.
The kernel also imposes the rule that only the owner or the
all-powerful root may change the permissions or ownership of a file.
Allocation of disk blocks is dependent upon organization of what is
called the freelist. When a file is opened for the first time,
its inode is read into the kernel generic inode table. This is a
"generic” table as it is valid for all filesystems.
Therefore, on subsequent reads and writes this information is already
available and the kernel does not have to make an additional disk
read to get the inode information. Remember, it is the inode that
contains the pointers to the actual data blocks. If this information
were not kept in the kernel, every time the file was accessed
this information would need to be read from the hard disk.
Keep in mind that if you have a process that is reading or writing to
the disk, it is the kernel that does the actual disk access. This is
done through the filesystem and hard disk drivers. Every time the
kernel does a read of a block device, the kernel first checks the
buffer cache to see if the data already exists there. If so, then the
kernel has saved itself a disk read. Obviously if it's not there, the
kernel must read it from the hard disk.
At first this seems liked a waste of time. I mean, checking one place
and then checking another. Every single read checks the buffer cache
first. So, in many cases, this is wasted time. True. However, the
buffer cache is in RAM. This can be several hundred times
faster than accessing the hard disk. As a result of the principle of
locality, your process (and the kernel as well) will probably be
accessing the same data over and over again. Therefore, the existence
of the buffer cache is actually a great time saver, since the number
of times it finds something in the cache (the hit ratio) is so high.
When writing a file (or parts of a file), the data is first written
to the buffer cache. If it remains unchanged for a specific period of
time (defined by the BDFLUSHR
kernel parameter), the data is then written to the disk. This also
saves times because if data is written to the disk, then changed
before it is read again, you've wasted a disk write. However, if it
stays in the buffer cache forever (or until the file is closed, the
process terminates, etc) then you run the risk of loosing data is the
system crashes. Therefore, BDFLUSHR
is set to a reasonable default of 30 seconds.
As I mentioned a moment ago, when a file is first opened, its inode
is read into the kernel's generic inode table. (Assuming it is not
already there) This table is the same no matter what kind of file
system you have (S51K, AFS, etc). The structure of this table is
defined in <sys/inode.h>.
The size of this is configurable in ODT 3.0 with the kernel parameter
The entries in the generic inode table are linked into hash queues. A
hash queue is basically a set of linked lists. Which list a
particular inode will go into dependents on it's value. This
speeds things up, since the kernel does not have to search the entire
inode table, but can immediately jump to the relatively smaller hash
queue. The more hash queues there are (defined by the NHINODE kernel
parameter) the faster things are read since each queue has fewer
entries. However, the more queues there are, the more space in memory
is required and less room for other things. Therefore, you need to
weigh one against the other.
Since there is normally no pattern as to which files a removed from
the inode table and when, the free slots in the table are spread
throughout the table randomly. Free entries in the generic table are
linked onto the freelist so new inodes may be allocated quickly.
One advantage that SCO UNIX provides is the ability to access
different kinds of filesystems. Because of this, the kernel must also
keep track of filesystem specific information, such as that contained
in the inode table. This information is also kept in a kernel
internal table, based on the filesystem. The System V dependent inode
data structure is defined in <sys/fs/s5inode.h>,
and is used by S15K, AFS and EAFS. Other inode tables exist for
High-Sierra and DOS. Each time a file is opened an entry is allocated
in both the generic and the System V dependent inode table (unless
already in memory). The information contained in these inode table is
going to be different, depending on what kind of filesystem you are
When a process wants to access a file, it does so using a system call
such as open() or
write(). When first writing the code for a program, the system
calls that programmers normally use are the same no matter what the
file system type. When the process is running and makes one of theses
system calls, the kernel maps that system call to operations
appropriate for that type of FS. This is necessary since the way a
file is accessed under DOS, for example, is different than under
EAFS. The mapping information is maintained in a table, one per file
system and is constructed during a relink from information in
/etc/conf/sfsys.d. The kernel then accesses the correct entry
in the table by using the FS type as an index into the fstypesw[
Another table used by the kernel to keep track of open files is the
file table. This allows many processes to share the same
inode information, and is defined in <sys/file.h>.
Because it is often the case that multiple process have the same file
open, this saves the kernel time, by not having to look up inode
information for each process individually. Once a file is open and is
in the file table, the kernel does not have to re-read the inode
Translation from file descriptions to data
By the time the kernel actually has the inode number of a file that
you are working with, it has gone through three different reference
points. First, there is the uarea of your process that has the
translation from your personal file descriptors to the entry in the
file table. Next, the file table has the references that point the
kernel to the appropriate slot in it's generic inode table. Last, the
generic inode table has the pointers to the file system specific
inode table. At first, this may seem like a lot of work. However,
keep in mind that this is all in RAM. Without this mechanism, the
kernel would have to go back and forth to the disk all the time.
The open() system call is
implemented internally as a call to the namei()
function. This is the name-to-inode conversions. Namei()
sets up both the generic inode table entry and filesystem dependent
inode table entry. It returns a pointer to the generic table entry.
Namei() then calls another
function, the routine falloc(),
which sets up an entry in the file table to point to the inode in
the generic table.
The kernel then calls the ufalloc()
routine, which sets up a file pointer in the process's uarea to point
to the file table entry set up by falloc().
Finally the return value to open()
is index into the file pointer array, known as the file descriptor.
The function of namei() is
a bit more complicated than just converting a filename to an inode
number. This seems like a simple thing to say, but in practice, there
is a lot more to it. Namei()
converts the filenames to inodes (not to inode numbers). Obviously it
must first get the inode number, but that is a relatively easy chore,
since that is the contained within the directory entry of the file.
In order to find out what inode table to read, namei()
needs to know on which filesystem a file resides. Simply reading the
inode from the directory entry is not enough. As we talked about
before, two completely different files can have the same inode
provided they are on different file systems. Therefore, even
though namei() has the
inode number, it still does not know which inode table to read.
In order to find the filesystem, namei()
needs to have a complete pathname to the file. A UNIX pathname
consists of zero or more directories, separated with '/' terminated
by the filename. The total path length cannot be more than 1024
characters. Assuming there is no directory name mentioned when the
file is opened, (or only a relative path) namei has to back track a
little to get back up to the top of the directory tree.
If not already in memory, the inode corresponding to the first
directory in the pathname is read into memory. The directory file is
read into memory and the inode/filename pairs are searched for the
next directory component. The next directory is read in and the
process continues until the actual file is reached. We now have the
inode of the file.
With relative paths or no paths at all, we have to back track. That
is, in order to find the root directory of the filesystem we are on,
we have to find the parent directory of our file, then it's
parent and so on until we reach the root.
Looking at this, we see the pathname to inode conversion is time
consuming. Each time a new directory is read, there must be a read of
the hard disk. In order to speed up things, SCO UNIX caches the
directories. The size of the cache is set by the s5cachent kernel
tunable parameter and the entries defined in <sys/fs/s5inode.h>.
Whenever the kernel searches for a component of the file name, it
checks the correct hash queue. In ODT 3.0 the s5cachent structures
can't hold more than 14 characters. Therefore, for the long file
names possible with EAFS, the kernel must go directly to the disk.
Cache hits and misses are recorded and can be retrieved with sar and
In a S51K (Traditional UNIX) filesystem, the superblock contains a
list of both free blocks and free inodes. There is room for 50 blocks
in the free block list and 100 inodes in the free inode list. The
structure of the superblock is found in <sys/fs/s5filsys.h>.
When creating a new file, the system examines the array of free
inode numbers in the superblock and the next free inode number
assigned. Since this list only has 100 entries, they will all
eventually get used up. If total number of free inodes drops to zero,
the list is filled in with another 100 from the disk. If there ever
less than 100 free inodes, then the unused entries are set to 0. In
S51K filesystems, the list of free data blocks, the freelist, is
ordered randomly. As disk blocks are freed, they are just appended to
end of freelist. During allocation of data blocks, no account is made
for physical location of the data blocks. This means that there is no
pattern to where the files reside on the disk, and can quickly lead
to fragmentation. That is, data blocks from one file can be scattered
all over the disk.
In AFS and EAFS the freelist is held as a bitmap, where adjacent bits
in the map correspond to logically contiguous blocks. Therefore the
system can quickly search for sets of bits representing free blocks
and then allocate files in contiguous blocks. Logically contiguous
blocks (usually physically contiguous blocks) are known as a cluster.
When the filesystem is first created, the bitmap is created by mkfs.
There is 1 bit for every data block on the filesystem, so the bitmap
is a linear array which says whether a particular block contains
valid data or not. Note that this bitmap also occupies disk blocks
itself. Actually there is more than one bitmap. There are several
which are spaced at intervals of approximately 8192 blocks throughout
the filesystem. Since a block contains 1024 bytes, it contains 8192
bits and can therefore map 8192 blocks. There is also an indirect
freelist block, which holds a list of the disk block numbers which
actually contain the bitmaps.
When a file is created, the entire cluster is reserved for the file.
Although this does tend to waste a little space, it reduces
fragmentation and therefore increases speed. When kernel reads a
block, it reads the whole cluster the file belongs to as well as the
next. This is called read ahead.
When a disk block is needed for a new file, the system searches the
bitmap for the first free block. If we later need more data blocks
for an existing file, the system begins it search starting from the
block number that was last allocated for that file. This helps to
ensure new blocks are close to existing ones. Note that when a
cluster is allocated, not all of the disk blocks may be free (maybe
it is already allocated to another file).
The bitmapped freelist of the AFS and EAFS has some performance
advantages. First, files are typically located in contiguous disk
blocks. These can be allocated quickly from the free list. using
i80386 bit manipulation instructions. This means that free areas of
the disk can be found in just a few instruction cycles and therefore
access speeds up.
Figure 0-6 The
In addition, the freelist is held in memory. The advantage is that
this keeps the system from having to make an additional disk access
every time the system wants to write new blocks to the hard disk.
When kernel issues an I/O request to read from a single disk block,
the AFS maps the request so that the entire cluster contain the disk
block and following cluster are read from disk.
At the beginning of each filesystem is filesystem specific structure
called the superblock. You can find out about the structure of the
superblock by looking in <sys/fs/*>. The Sys V superblock is
located in 2nd half of first block of filesystems (bytes 512-1023).
Since the structure is less than 512 bytes, it contains padding to
fill out to 512 bytes. When a filesystem is first mounted, its
superblock is read into memory so updates to the superblock don't
have to constantly write to the disk.
In order for the structures on the disk to remain compatible with the
copies in memory, superblocks and inodes are updated by sync
which is started at regular intervals by init.
The frequency of the sync
is defined by SLEEPTIME in /etc/default/boot,
with a default of 60 seconds.
There are several concepts new to Open Server. The first is
intent logging. When this functionality is enabled, filesystem
transactions are recorded in a log and then committed
to disk. If the system goes down before the transaction is completed,
the log is replayed to complete pending transactions. This
scheme increases reliability and recover speed since the system need
only read the log to be able to bring the system to the correct
state. By using this scheme, the time spent checking the filesystem
(and repairing it if necessary) can be reduced to just a few seconds,
not the several minutes that was required previously, regardless
of the filesystem size. There is, however, a small performance
penalty since the system has to spend some time writing to the logs.
As changes are being made to any of the control structures (inodes,
superblock), the changes are written to a log. Once complete, the
transactions is marked as complete. However, if the system
should go down before the log is written, it is as if the transaction
was never started. If the log is complete, but the transaction hasn't
finished, the transaction can either be completed or ignored,
depending what fsck
considers possible. Obviously if the system goes down after the
transaction is complete, then nothing needs to be done.
The location of the log file is stored in the superblock. As a real
file it does reside somewhere on the file
system, however it is invisible to normal user-level utilities and
only becomes visible when logging is disabled.
logging does bring up one misconception in that it does not
increase the reliability of the system. Only changes to the
control structures are logged. Data is not. The purpose here
is to reduce the time it takes to make the system operational again
should it go down.
Another new the concept is checkpointing. When enabled
the filesystem is marked as "clean" at regular intervals.
That is, the pending writes are completed, inodes are updated and, if
necessary, the in-core copy of the superblock is written to disk. At
this point the filesystem is considered clean. Should the system go
down improperly at this point, there is no need to clean the
filesystem (using fsck) as
it is already clean. However, the data is still cached in the buffer
cache, so if it is needed again soon, it is available.
If the system goes down, the contents of the buffer cache are lost,
but since they were already written to disk, no data is actually
lost. Obviously, anything not written between the last checkpoint and
the time the system goes down is lost, but checkpointing does
decrease the amount lost as well as speed up the recovery process
when the system is rebooted. Again, there is no such thing as a free
lunch and checkpointing does mean a small performance loss.
Checkpointing is turned on by default on High Throughput Filesystem
(HTFS) , EAFS, AFS, and S51K filesystems.
For the best reliability and speed of recovery, it's a good idea to
have both logging and checkpointing enabled. Although they
both cause slight performance degradation, the benefits outweigh the
performance hit. In most cases, the performance loss is not noticed,
only the time required to bring the system back up is a lot quicker.
of sync-on-close for the Desktop Filesystem (DTFS) is another way of
increasing reliability. Whenever a file is closed, it is immediately
written to disk, rather than waiting for the system to write it as it
normally would (potentially 30 seconds later). If the system should
do down improperly, you have a better chance of not loosing data.
Because you are not writing data to the hard disk in large chunks,
sync-on-close also degrades performance.
Because I regularly suffer from digitalus enormus (fat
fingers), I am often typing in things that I later regret. On a few
occasions, I have entered rm
commands with wild cards (*,
for example) only to find that I had a extra space before the
asterisk. As a result, I end up with a nice clean directory. Since I
am not that stupid, I built an alias so that every time I used
rm it would prompt me to
confirm the removal (rm -i).
My brother, on the other hand, created an alias where rm
copies the files into a TRASH
directory, which he needs to clean out regularly. Both of these
solutions can help you recover from accidentally erasing files.
OpenServer has adding something, whereby you no longer have to create
aliases or other things necessary to keep you from erasing things you
shouldn't. This is the idea of file versioning. Not
only does file versioning protect you from digitalus enormus,
but will also make automatic copies of files for you.
In order for versioning to be used, it must be first configured in
the kernel. There are several kernel tunable parameters that are
involved. So to change them you either run the program
click on the "Tune Parameters..." button in the
Hardware/Kernel Manager. (The Hardware/Kernel Manager calls
configure). Next select
option 10 (Filesystem configuration). Here you will need to set the
MAXVDEPTH parameter, which set the maximum number of versions
maintained and the MINVTIME parameter which set the minimum time (in
seconds) between changes before a file is versioned. Setting
MAXVDEPTH to 0 disables versioning. If MINVTIME is set to 0, and
MAXVDEPTH to a non-zero value, then versioning will happen no matter
how short the time between versions. Versioning is only available for
the DTFS and HTFS.
You can also set versioning for a filesystem by using the maxvdepth
and minvtime options when mounting. These can be included in
(which defines the default behavior when mounting filesystems), or
you can specify them on the command line when mounting the filesystem
by hand. In addition to that, versioning can be set on a
per-directory basis. This is done by using the undelete command. For
command line turns on versioning for all the files in the directory
/usr/jimmo/letters as well
as any child directories. This includes existing files and
directories and well as ones created later. Note that even though the
filesystem was not mounted with either the minvtime or maxvdepth
options, you can still turn on versioning for individual directories,
as long as it is configured in the kernel. Also, using the -v
option to undelete
you can turn on versioning for single files.
When enabled, versioning is performed without the interaction of the
users. If you delete or overwrite a file, you usually don't see
anything. You can make the existing versions visible to you by
setting the SHOWVERSIONS environment variable to 1, and then
The means of storing versions is quite simple. The names are appended
with a semi-colon followed by the version of the file as in:
This would be the 12th version of the file letter
since versioning was enabled on the filesystem. Keep in mind
that this does not mean that there are 12 versions. The number of
available versions is defined by the MAXVDEPTH kernel parameter or
mount option. If higher than 12, there just might be 12 versions.
However, if set to a lower value you will see at most MAXVDEPTH
versions. Also keep in mind that you are are not just mainting a list
of changes, but rather complete copies of each file.
For example, let's assume I mounted a filesystem with the option -o
system will then save, at most, 10 versions. After I edit and save a
file for a while, the version number might be up to 12. However, I
will not be able to see or have access to versions lower than 3,
since there are removed from the system.
Different file versions can not only be accessed when making copies
or changes to existing file, but also when you remove them. Assume
you have the three latest versions of a letter (letter;10,
letter;11 and letter;12)
as well as the current version letter.
If you remove letter,
the three previous versions still exist. These can be seen by using
the -l (list) option to
undelete, either by
specifying the file explicitly as in:
or if you leave off the file name, you will see all versions of all
files. To undelete a versioned file or make the previous version the
current one, simply leave off the options. If you repeated use
with just the file name, you can backup and make ever older
versions the current one. Or, to make things easier, simply copy the
older version to the current one, as in:
cp letter\;8 letter
This will make version 8 the current one. (NOTE: The \ is necessary
to remove the special meaning of the semi-colon.)
With the first shipping version of OpenServer, there are some
"issues" with versioning in that it does not behave as
expected. One of the first things I noticed was that changing the
kernel parameters MAXVDEPTH and MINVTIME do not turn on versioning.
Instead, they allow versioning to be turned on. Without them, you
can't get versioning to work at all. When version is enabled, you
still need to use undelete
-s on the directory.
There is more to it than that. However, I don't want to repeat too
much information that's in the manuals. Therefore, take a look at the
other changes that have been made to the system. There is the
introduction of a couple new filesystem types as well as adding new
features to the old filesystems. Table 0.4 contains an overview of
some of the more significant aspects of the filesystems.
Max. fs size
Max. file size
New functionality in
Symbolic links in
Fast filesys. check
Lazy block list
Table 0.4 Filesystem Characteristics
High Throughput Filesystem
New to OpenServer is the introduction of a new filesystem
device driver: ht. This new driver can handle filesystems with 16-bit
inodes like S51K, AFS and EAFS, but also the new HTFS which can
handle 32-bit inodes. Although (as of this writing) you cannot boot
from an HTFS, it does provide some important performance and
One area that was changed is the total amount of information that can
be stored on a single HTFS as the total number of inodes that can be
used. Table 0.4 contains a comparison of the various filesystem types
and just how much data they can access.
Another new feature of the ht driver is lazy block evaluation.
Previously, when a process was started with the exec() system call,
the system would build a full list of the blocks that made up that
program. This delayed the actual start-up of the process, but save
time as the program ran. Since a program spends most of it's time
executing the same instructions, much of the program is not used.
That is, many of the blocks end up never being referenced. What lazy
block evaluation does is to build this list of blocks only as they
are needed. This speeds up the start-up of the process and causes
small delays when a previously unreferenced block is first access.
Another gain is through "transaction based" processing of
the filesystem. As activity occurs on the system, they are gathered
together in what is called an intent log, which we talked
about earlier. If the system stops improperly, the system can use the
intent log to make a determination of how to proceed. Since you only
need to check the log in order to clean the filesystem, it is quicker
and also more reliable.
Another mechanisms used to increase throughput is to disable
checkpointing. This way, the filesystem will spent all of it's time
processing requests rather than updating the filesystem structures.
Although this increases throughput, you obviously have the
disadvantage of potentially loosing data.
When dealing with aspects of the system like the print spooler or the
mail system when jobs are batched processed, at any given moment it
is less likely that data is being processed. Therefore you do not
need the extra overhead of checkpointing.
This is done by treating the filesystem as "temporary".
Such filesystems are mounted with the -o
tmp option. Although checkpointing is new to OpenServer, you
can configure both AFS and EAFS filesystems as temporary. Keep in
mind that certain applications like vi'
provide their own recovery mechanism by saving the data at regular
intervals. If the files are written by vi,
but not written to disk, a system crash could loose the last update.
When I described the directory structure I mentioned that each inode
was represented by two bytes. This allows only for 64K worth of
inodes. Since the HTFS can access 227 inodes and the DTFS
can access 231, there needs to be some other format used
in the directories. With the two new filesystems, the key word is
"extensible." This mean that the structure can be extended
as the requirement changes. This allows much more efficient handling
of long files names, as compared to the EAFS. In most cases, the
filesystem driver is capable of making the translation for
applications that don't understand the new concepts. However, if the
applications reads or write the directory directly, you may need a
newer version of the application.
The two new filesystems, HTFS and DTFS, can save space by storing
symbolic links in the inode. If the path of the symbolic link is 108
characters or less, the DTFS will store the path within the inode and
not in a disk block on the disk. For the HTFS, this limit is 52
characters. First this saves space as not data blocks are needed, but
it also saves times since once you read the inode from the inode
table, you have the path and do not need to access the disk.
There are two issue to
keep in mind. If you use relative paths instead
of absolute paths, then
you may end up with a shorter path that fits into the inode. This
saves time when accessing the link. On the other hand, think back to
our discussion on symbolic links. The behavior of each shell when
crossing the links is different. If you fail to take this into
account, you may end up somewhere other than you expect.
One of the problems that the advances that SCO OpenServer
brought with it is the increased amount of hard disk space required
to install it. On large servers with several gigabytes of space, this
is less of an issue. However, on smaller desktop workstations this
can become a significant problem.
Operating systems have been dealing with this issue for years. MS-DOS
provides a potential solution in the form of it's DoubleSpace disk
compression program. Realizing the need for such space savings, SCO
Open Server provides a solution in the form of the new DTFS. Among
the issues that need to be addressed is not only the saving of space,
but also the reliability of the data and avoiding any performance
degradation that occurs when compressed files need to be
uncompressed. On fast CPUs with fast hard disks, the preformance hit
because of the compression is noticeable.
The first issue (saving space) is addressed by the DTFS in a couple
of ways. The first is that files are compressed before they are
written to the hard disk. This can save anywhere from just a few
percent in the case of binary programs to over 50% for text files.
What you get will depend on the data you are storing.
The second method space is saved is the way inodes are store on the
disk. With "traditional" filesystems such as S51K or EAFS,
inodes are pre-allocated. That is, when the filesystem is first
created, a table for the inodes allocated at the beginning of the
filesystem. This is a consistent size no matter how many or how few
inodes are actually used. Inodes on a DTFS are allocated as needed.
Therefore, there are only as many inodes as there are files.
As a result you never have any empty slots in the inode table.
(Actually there is no inode table in the form we discussed for other
filesystems. We'll get to this in a moment.) In order to distinguish
these inodes from others, inodes on the DTFS are referred to as
Figure 0-7 The
The DTFS has many of the same features as the EAFS filesystem, such
as file length up to 255 characters and symbolic links. In addition,
the DTFS also has multiple compression algorithms, greater
reliability (through the integrated kernel update daemon, which
attempts to keep the file system in a stable state), and dynamic
block allocation algorithm that can automatically switch between
best-fit and first fit. Best-fit is where the system looks for an
appropriately size spot on the hard disk for the file and first-fit
is where the system looks for the first one that is large enough
(even if it is much larger than necessary).
As one might expect, the disk layout is different from other
filesystems. The first block (block 0) was historically the "boot
block" and has been retained for compatibility purposes. The
second block (block 1) is the super block and like other filesystems
it contains global information about the filesystem.
Following the superblock is the block bit-map. There is one block for
each 512-byte data block in the filesystem, so the size of the bitmap
will vary depending on the size of the filesystem. If the bit is on
(1), the block is free, otherwise the block is allocated.
The block bitmap is followed by the dtnode bitmap. It's size is the
same as the block bitmap since there is also one bit for each block.
The difference is that these bits determine if the corresponding
block contains data or dtnodes. A 1 indicates the block contains
dtnodes and a 0 indicates data. Following these two bitmaps are the
actual data and dtnode blocks. Since the dtnodes are scattered
throughout the filesystem, there is no inode table.
Unlike the inodes of other filesystems, dtnodes are not pre-allocated
when the filesystem is created. Instead, they are allocated at the
same time as the corresponding file. This has the potential for
saving a great deal of space since every dtnode points to data in
contrast to other filesystems where inodes may go unused and
therefore the space they occupy is wasted.
The translation from dtnode number is straight forward. The dtnode
number has the same number as the block number that it resides on.
For example, if block 1256 was a dtnode, then that dtnode number
would be 1256. This means that since not all blocks contain dtnodes,
not all dtnode numbers are used. The one exception to this is that
the dtnode number of the root of the filesystem is stored in the
superblock. Each dtnode is accessed through the dtnode map.
The contents of the superblock are found in the files location in
<sys/fs/ >. If you
take a quick look at it you see several important pieces of
information. One of the most important ones is the size of the
filesystem. Many of the other parameters included in this structure
can be calculated from this value. These include the root dtnode
number, start of the bitmaps, the start of the data blocks, as well
as the number of free blocks. Although the values can be calculated,
it saves time by also storing these values in the superblock.
As I mentioned earlier, the block size of the DTFS varies in
increments of 512 bytes between 512 and 4096. The reason for the
range is that empirical studies have shown that filesystem throughput
increases as the block size increases. However, in an effort to save
space (a primary consideration in the DTFS), smaller block sizes were
Before being written to the disk, regular files are compressed
using one of two algorithms (one being "no compression").
Because of this compression, it is no longer possible to directly
calculate a physical block on the hard disk based on the offset in
the file. For example, let's assume a file that begins at block 142
of the filesystem. On a non-compressed filesystem, we could easily
find byte 712 since block 0 of the file contains bytes 0-511, and
block 1 contains bytes 512-1023. Therefore, byte 712 is in block 101
of the filesystem.
However, if we have a compressed filesystem, there is no immediate
way of knowing if the compression is sufficient to place byte 712
into block 142, or it is still in block 143. We could start at the
beginning of the file and calculate how much uncompressed data is in
each block. Although this would eventually give us the correct block,
the amount of time spent doing the calculations more than eliminates
advantages gained by the compression.
In order to solve this problem, the structure on the hard disk is
maintained in a structure called a B-tree. Without turning this book
into a discussion on programming techniques, it is sufficient to say
that the basic principle of a B-tree forces it to be balanced,
therefore the depth of one leaf node is at most one level away from
the depth of any other leaf node.
Conceptually the B-tree works like this: Let's assume a block ‘a',
is the root node. The block offset of every data block that is on the
left hand branch of a is smaller than the block offset in a. Also,
the block offset of every data block that is on the right hand branch
of ‘a' is larger than the block offset in ‘a'.
This then applies to all subsequent blocks, where the left hand
branch is smaller and the right hand branch is larger.
In order to find a particular offset in the file you start at the top
of the tree and work down. If the block offset is less than the root,
you go down the left hand branch. Likewise, if the block offset is
greater you go down the right hand branch. Although you still have to
traverse the tree, the amount you have to search is far less than a
pure linear search. Each node has a pointer to both the previous and
the next nodes. This allows traversal of the tree in both directions.
Regular files are the only ones that are compressed. Although
supported, symbolic links and device nodes are left as they are,
since you don't save any space. If a symbolic link is smaller
than 192 bytes, the name is actually stored within the dtnode. The
size of blocks containing directories is fixed at 512 since
directories are typically small. Long names are allowed on the DTFS
up to a maximum of 255 characters (plus the terminating NULL). One
interesting aspect is the layout of the directory structure. This is
substantially different than on the (E)AFS. Among other things there
are entries for the size of the filename and size of the directory
The DTFS has several built in features that provide certain
protections. The first is a technique called "shadow paging."
When a data is about to be modified, extras blocks are allocated that
"shadow" the blocks that are going to be changed. The
changes are then made to this shadow. Once the change is complete,
the changed blocks "replace" the previous blocks in the
tree and the old blocks are then freed up. This is also how the
dtnode blocks are modified except that the shadow is contained within
the same physical block.
If something should happen before the new, changed block replaces the
old one, then it is as if the change was never started. This is
because the file has no knowledge of anything ever happening to it.
Unlike changes on an EAFS, AFS and other "traditional”
UNIX filesystems, where changes are made to blocks that are already a
part of the file. If the system should go down in the middle of
writing, then the data is, at best inconsistent, or at worst,
trashed. Obviously, in both cases, once the changes are complete and
something happens, the file remains unaffected.
Updating blocks on a DTFS
Also unique to the DTFS is the way the dtnodes are updated. If you
look at the structure and count up the number of bytes, you find that
the amount of data that each block takes up is less than half the
size of the block (512 bytes). The other half is used as the shadow
for that dtnode. When it gets updated, the other half is written to
first, only after the information is safely written does the new half
become "active". Here again, if the system crashed before
the transaction was complete, then it would appear as if nothing was
ever started. By comparing the timestamp we can tell which half is
Other than saving space, there is another logic to splitting the
block in half. Remember that the dtnode points to the nodes that are
both above it and below it in the tree. Assume we didn't shadow
the dtnode. When one dtnode gets updated, it would get replaced by a
new node. Now the nodes above it and below it need to be modified to
point to this new node. In order to update them, we have to copy them
to new blocks as well. Now, the nodes pointing to these blocks need
to get updated. This "ripples" in both directions until the
entire tree is update. Quite a waste of time.
Another technique used to increase the reliability is the update
daemon (htepi≠_daemon). Once per second the update daemon checks
filesystem. If the update daemon writes out all the data to that
filesystem before another process writes to that filesystem, the
update daemon can write out the superblock as well and can then mark
the filesystem as clean. If the system were to crash before another
process made a write to that filesystem, then it would still be clean
and therefore no fsck would be necessary.
Built into the dtnode is also a pointer to the parent directory of
that dtnode. This has a very significant advantage when the system
crash and the directory entry for a particular file gets trashed. In
traditional SCO filesystems, if this happened, there would be files
without names and when fsck ran, they would be placed in lost+found.
Now, since each file knows who its parent is, the directory structure
can easily be rebuilt. That's why there is no more lost+found
High-Performance Pipe System
In the section on operating system basics I introduced the
concept of a pipe. We all (hopefully) know it through the many
commands we have seen in this book. For example, if I want to see the
long listing of some directory a screen at a time, I can issue the
ls -l | more
As I mentioned, there are actually data blocks taken up on the hard
disk to store the data as the system is waiting for the receiving
side to read it. For all intents and purposes this is a real file. It
contains data (usually) and it has an inode. The only difference is
that, unless it is a named pipe, it has no entry in any directory and
therefore no file name. When then system goes down by accident and
cannot close the pipes, fsck
will report them as unreferenced files. This is very disconcerting to
many users as they see a long list of unreferenced files when fsck
runs after a crash.
This represents only one of the problems existing with traditional
pipes. The other is the fact that these pipes exist on the hard disk.
When the first process writes to the disk, there is a disk access.
When the second process reads the disk, there is a disk access. Since
disk access a bottleneck on most systems, this can be a problem.
(NOTE: This ignores the existence of the buffer cache. However, if
sufficient time passes between the write and subsequent read, then
the buffer cache will no longer contain the data and two disk
accesses are necessary)
SCO OpenServer has done something to correct that. This is the High
Performance Pipe System (HPPS). The primary difference between the
HPPS and conventional pipes is that the HPPS pipes no longer exist on
the hard disk. Instead, they are maintain solely within the kernel as
buffers. This corrects the two previously discussed disadvantages of
conventional pipes. First, when the system goes down, the pipes
simply disappear. Second, since there is no disk interaction, there
is never any performance slow-down as a result.
Like traditional pipes, when HPPS pipes are created an inode is
created with it. This inode contains the necessary information to
administer that pipe.
One of the major additions to OpenServer is the idea of
"virtual disks". These can come in many forms and sizes,
each providing its own special benefits and advantages. To the
running program (whether it is an application or system command),
these disks appear like any other. As a user, the only difference you
see is perhaps in the performance improvements that some of these
virtual disks can yield.
There are several different kinds of virtual disks which can be used
depending on your needs. For example, you may be running a database
that requires more contiguous space than you have on any one drive.
Pieces of different drives can be configured to a single,
larger drive. If you need a quicker way of recovering from a hard
disk crash, you can mirror your disks, where one disk is an
exact copy of the other. This also increases performance since you
can read from either disk. Performance can also be increased by
striping your disks. This is where portions of the logical
disk are spread across multiple physical disks. Data can be written
to and read to the disks in parallel, thereby increasing performance.
Some of these can even be combined.
Underlying many of the virtual disks type is the concept of RAID.
RAID is an acronym for Redundant Array of Inexpensive Disks.
Originally, the idea was that you would get better performance and
reliability from several, less expensive drives linked together as
you would from a single, more expensive drive. The key change in the
entire concept is that hard disk prices have dropped so dramatically
that RAID is no longer concerned with inexpensive drives. So much so,
that the I in RAID is often interpreted as meaning "Intelligent",
rather than "Inexpensive."
In the original paper that defined RAID, there were five levels.
Since that paper was written, the concept has been expanded and
revised. In some cases, characteristics of the original levels are
combined to form new levels.
Two concepts are key to understanding RAID. These are redundancy
and parity. The concept of parity is no different than that
used in serial communication, except for the fact that the parity in
a RAID system can be used to not only detect errors, but correct
them. This is because more than just a single bit is used per byte of
data. The parity information is stored on a drive separate from the
data. When an error is detected, the information is used from the
good drives, plus the parity information to correct the error. It is
also possible to have an entire drive fail completely and still be
able to continue working. Usually the drive can be replaced and the
information on it rebuilt even while the system is running.
Redundancy is the idea that all information is duplicated. If you
have a system where one disks is an exact copy of another, one disk
is redundant for the other.
In some cases, drives can be replaced even while the system is
running. This is the concept of a hot spare. This is
done from the Virtual Disk Manager. Some hardware vendors even
provide the ability to physically remove the drive from the system
without having to shut the system down. This is called a hot swap
All the control for the hard disks is done by the hard disk
controller and the operating system sees only a single hard disk. In
either case, data is recreated on the spare as the system is running.
Keep in mind that SCO does not directly support hot
swapping. This must be supported by the hardware in order to
ensure the integrity and safety of your data.
SCO's implementation of RAID is purely software. Makes sense since
SCO is a software company. Other companies provide hardware
solutions. In many cases, hardware implementations of RAID present a
single. logical drive to the operating system. In other words, the
operating system is not even aware of the RAIDness of the drives it
is running on.
In Figure 0-9 we see how the different layers of a virtual drive are
related. When an application (vi,
a shell, cpio) accesses a
file, it makes as system call. Depending on whether you are accessing
a file through a raw device or the the filesystem code, the
application uses the block or character code within the device
driver. The device driver it accesses at this point is for the
virtual disk. the virtual disk then access the device driver for the
physical hard disk.
What device the virtual device driver access depends on how it
is configured. (Which we will get to shorty.) The interesting thing
is that the virtual disk driver access the device in the same way,
regardless of what type of disk it is. Accessing the physical disk is
the problem of the physical disk driver and not the virtual
disk driver. It is therefore possible to have virtual disks composed
of different types of disks as well as disks on different
The simplest virtual disk is called (what else) a simple
disk. With this, you can define all your non-root filesystem space as
a single virtual disk. This can be done to existing filesystems and
not only provides more efficient storage, but using virtual disks
instead of conventional filesystems makes it easier to change to the
more complex virtual disks. This is because you cannot add existing
filesystem to virtual disks. They must be first converted to simple
A concatenated disk is created when two or more disk
pieces are combined. In this way, you can create logical disks that
are larger than any single disk. Disks that are concatenated together
do not need to be the same size. The total available space is simply
the sum of call concatenated disks. New peices cannot be added to
concatenated disks once the filesystem is created. Remember that the
filesystem sees this as a logical drive. Division and inode tables
are based on the size of the drive when it is added to the system.
Adding a new piece would require you to recreate the filesystem.
A striped array is also referred to as RAID 0 or RAID Level 0.
Here, portions of the data are written to and read from multiple
disks in parallel. This greatly increases the speed at which data can
be accessed. This is because half of the data is being read or
written by each hard disk, which cuts the access time almost in half.
The amount of data that is written to a single disk is referred to as
the stripe width. For example, if single blocks are written to
each disk, then the stripe width would be a block.
Virtual Disk Layers
This type of virtual disk provides increased performance since data
is being read from multiple disks simultaneously. Since there is no
parity to update when data is written, this is faster than system
using parity. However, the drawback is that there is no redundancy.
If one disk goes out, then data is probably lost. Such a system is
more suited for organizations where speed is more important than
Keep in mind that data is written to all the physical drives each
time data is written to the logical disk. Therefore, the pieces must
all be the same size. For example, you could not have one piece that
was 500 MB and a second piece that was only 400 Mb. (Where would the
other 100 be written?) Here again, the total amount of space
available is the sum of all the pieces.
Disk mirroring (also referred to as RAID 1) is where data from the
first drive is duplicated on the second drive. When data is written
to the primary drive, it is automatically written to the secondary
drive as well. Although this slows things down a bit when data is
written, when data is read it can be read from either disk, thus
increasing performance. Mirrored systems are best employed
where there is a large database application
Striped Array With No Parity (RAID 0)
and availability of the data (transaction speed and reliability) is
more important than storage efficiency. Another consideration is the
speed of the system. Since it takes longer than normal to write data,
mirrored systems are bettered suited to database applications where
queries are more common than updates.
As of this writing, OpenServer does not provides for mirror of the
Therefore, you will need to copy this information somewhere else. One
solution would be for you to create a copy of the /dev/stand
filesystem on the mirror driver yourself. I have been told by people
at SCO that an Extended Funtionality Supplement (EFS) is planned to
allow you to mirror /dev/stand
and boot from it, as well.
The term used for RAID 4 is a block-interleaved undistributed parity
array. Like RAID 0, RAID 4 is also based on striping, but redundancy
is built in with parity information written to a separate drive. The
term "undistributed" is used since a single drive is used
to store the parity information. If one drive fails (or even a
portion of the drive), the missing data can be created using the
information on the parity disk. It is possible to continue working
even with one drive inoperable since the parity drive is used
on-the-fly to recreate the data. Even data written to the disk is
still valid since the parity information is updated as well. This is
not intended as a means of running your system indefinitely with a
drive missing, but rather it gives you the chance to stop your system
Striped Array with Undistributed Parity (RAID 4)
RAID 5 takes this one step further and distributes the parity
information to all drives. For example, the parity drive for block 1
might be drive 5 but the parity drive for block 2 is drive 4. With
RAID 4, the single parity drive was accessed on every single data
write, which decreased overall performance. Since data and parity and
interspersed on a RAID 5 system, no single drive is overburdened. In
both cases, the parity information is generated during the write and
should drive go out, the missing data can be recreated. Here again,
you can recreated the data while the system is running, if a hot
spare is used.
Striped Array With Distributed Parity (RAID 5)
As I mentioned before, some of the characteristics can be combined.
For example, it is not uncommon to have to have stripped arrays
mirrored as well. This provides the speed of a striped array with
redundancy of a mirrored array, without the expense necessary to
implement RAID 5. Such a system would probably be referred to as RAID
10 (RAID 1 plus RAID 0). All of these are configured and administered
using the Virtual Disk Manager, which then calls the dkconfig
utility. It is advised that, at first, you use the Virtual Disk
Manager since it is easier to use. However, once you get the hang of
things, there is nothing wrong with using dkconfig
The information for each virtual disk is kept in the /etc/dktab
file and is used by dkconfig
to administer virtual disks. Each entry is made up of two lines. The
first is the virtual disk declaration line. This is followed by one
or more virtual piece definition lines.
Here we have an example of an entry in a dktab
file that would be used to create a 1 GB array. (This is RAID
array 5 16
/dev/dsk/1s1 100 492000
/dev/dsk/2s1 100 492000
/dev/dsk/3s1 100 492000
/dev/dsk/4s1 100 492000
/dev/dsk/5s1 100 492000
The first line is virtual disk declaration line and varies in the
number of fields depending on what type it is. In each case, the
first entry is the device name for the virtual device followed by
what type it is. For example, if you have a simple virtual disk,
there is only the device name followed by the type (simple). Here, we
are creating a disk array, so we have array
in the type field.
A simple disk consists of just a single piece. The other types, such
as mirror, concatenated, etc, require a third field to indicate how
many pieces (simple disks) go into making up the virtual disk. Since
we are creating a disk our of five pieces, this value is 5.
If you use striped disks or disk arrays, then the fourth field
defines the size of the cluster in 512-byte blocks. We are using a
value of 16, therefore we
have an 8K cluster size. If you have mirrored disks, then the fourth
field is the "catch up" block size and is used when the
system is being restored.
The virtual piece definition line describes a piece of the virtual
disk. (In this case we have five pieces) It consists of three fields.
The first is the device node of the physical device. Note that in our
case, each of the physical drives is a separate physical drive. (We
know this because of the device names 1s1-5s1).
The second field is the offset from the beginning of the physical
device of where to start the disk piece. Be sure you leave enough
room so you start beyond the division and bad track tables. Here we
are using a value of 100
and since the units are disks blocks (512 bytes), we are starting 50K
from the beginning of the partitions, which is plenty of room.
The third field is the length of the disk piece, Here you need to be
sure that you do not go beyond the end of the disk piece. In our case
we are specifying 492000.
This is also in disk blocks. Therefore, each of the physical pieces
is just under 250Mb. Since the actual amount of storage we get is the
sum of all the pieces, we have just under 1000Mb or 1Gb.
To change this array to RAID 4, where there is a single drive that is
used solely for parity, we could add a fourth field to one of the
virtual piece description lines. For example, If we wanted to turn
drive three into the parity drive, we would change it to look like
Okay, so you've decided that you need to increase performance or
reliability (or both) and have decided to implement a virtual disk
scheme. Well, which one? Before you decide, there are several things
you need to consider. The System Administrators Guide contains a
checklist of things to consider when deciding which is best for you.
Things to Consider
If you create an emergency boot/root floppy on a system with virtual
disks, there are a couple of things to remember. First, once you
create a virtual disk, you should create a new boot/root floppy set.
This is especially important if the virtual disk you are adding is a
mirror of the root disk. If you do not and later need to boot from
the floppy, then any changes made to the root filesystem will not be
made to the mirror. The drives will then be inconsistant.
In order to boot correctly, you need to change the default boot
string. Normally, the default boot string points to hd(40) for the
root filesystem. Instead, you need to change it to reflect the fact
that the root filesystem is mirror. For example, you could use the
tells the system to use virtual disk 1 as the root filesystem.
Note also that the device names are probably different from one
machine to another. Therefore, it may not be possible to use the
boot/root floppy set from one machine on another.
Its also possible to "nest" virtual drives. For example,
you could have several drives that you make into to a striped array.
This striped array is seen as a single drive, which you can then
include in another virtual disks. For example, you could mirror that
Be careful with this, however. It is not recommend that you nest
virtual drivers with redundant (mirrored, RAID 5) inside of
other virtual disks. This can cause the virtual disk driver to hang,
preventing access to all virtual drives.
Accessing DOS Files
Even on a standard SCO UNIX system, without all the bells and
whistles of TCP/IP, X-Windows, and SCO Merge, there are several
tools that you can use to access DOS filesystems. These can be found
on the doscmd (C)
man-page. Although these tools have some obvious limitations due to
the differences in the two operating systems, they provide the
mechanism to exchange data between the two systems.
Copying files between DOS and UNIX systems presents a unique set of
problems. One of the most commonly misunderstood aspects of this is
using wildcards to copy files from DOS to UNIX. This we can do using
One might think that this command copies all the files from the a:
drive (which is assumed to be DOS formatted) into the current
directory. The first problem is the way that DOS interprets
wildcards. Using a single asterisk would only match files without an
extension. For example, it would match LETTER, but not LETTER.TXT.
So, if we exand the wildcard to include the possibility of the
extensions, we get:
Which should copy everything from the floppy into the current
directory. Unfortunately, that's not the way it works either. Instead
of the message:
/dev/install:* not found
You get the slight variation:
/dev/install:*.* not found
Remember from our discussion of shell basics. It is the shell that is
doing the expansion. Since nothing matches, we get this error. The
solution to the problem was a little shell script that does a listing
of the DOS device. Before we go on, we need to side-step a little.
There are two ways to get a directory listing off a DOS disk. The
first is with the dosdir command.
This gives you output that appears just like you if you had run the
dir command under native DOS. In order to use this output, we would
have to parse each line to get the file name. Not an easy thing. The
other is dosls, which
gives a listing that looks like the UNIX ls command. Here you have a
single column of files names with nothing else. Much easier to parse.
The problem is that the file names come off in capital letters.
Although this is not a major problem, I like to keep my file names as
consistant as possible. Therefore, I want to convert them to lower
Skipping the normal things I put into scripts like usage messages and
argument checking, the script could look like this:
$DIR | while read file
"$dosdir/$file" `echo $file | tr "[A-Z]" "[a-z]"`
‹ Note the back-ticks
The script takes a single argument which is assigned to the DIR
variable. We then do a dosls
of that directory which is piped to the read. If we think back to the
section on shell programming, we know that this construct reads input
from the previous command (in this case dosls)
until the output ends. Next, we have a do-done loop that is done once
for each line. In the loop, we echo the name of the file (I like to
seeing what's going on) and then make the doscp.
The doscp line is a little
complex. The first part ($dosdir/$file)
is the source file. The second part, as you would guess, is the
destination file. However, the syntax here gets a little bit.
Remember that the back-ticks mean "the output of the command".
Here, that command is echo | tr.
Note that we are echoing the file name through tr
and not the contents. It is then translated in such a way that
all capital letters are converted to lower case. See the tr(C)
man-page for more details.
To go the other way (UNIX to DOS), we don't have that problem. Wild
cards are expanded correctly, so we end up with the right files. In
addition, we don't need to worry about the names, since they are
converted for us. The problem lies in names that do not fit into the
DOS 8.3 standard. If a name is longer that eight characters or the
extension is longer than three characters it is simply truncated. For
example, the name letter_to_jim.txt
ends up as letter_t.txt,
or letter.to.jim becomes
One thing to keep in mind here is that copying files like this
is only really useful for text files and data files. You could
use it to copy executables to your SCO system if you were running SCO
Merge, for example. However, this process does not convert a
DOS executable into a form that native SCO can understand, nor
Be careful when copying files because of conversions that are
made. With UNIX text files, each line is ended with a carriage return
(CR) character. The system converts this to a carriage return-new
line (NL) pair when outputting the line. You can ensure that when
copying files from DOS to UNIX that the CR-NL is converted to simply
a CR by using the -m option
to doscp. This also
ensures that the CR is converted to a CD-NL when copying the other
way. If you want to ensure that no conversion is made, use the -r
You can also make the conversion using either the xtod
or dtox commands. The
xtod command converts UNIX
files to DOS format and the dtox
converts DOS format files to UNIX. In both cases, the command
takes a single argument and outputs to stdout. Therefore, to actually
"copy" to a file you need to re-direct stdout.
An alternative to doscp
is to mount the DOS disk. Afterwards you can use standard UNIX
commands like cp to copy
files. Although this isn't the best idea for floppies, it is
wonderful for DOS hard disks. In fact, I have it configured so that
all of my DOS file systems are mounted automatically via
To be able to do this, you have to add the support for it in the
kernel. Fortunately, it is simply a switch that is turned on or off
via the mkdev dos script.
Since it makes changes to the kernel, you need to relink and reboot.
Once you have run mkdev dos,
you can mount DOS filesystem by hand or, as I said, through
example, if we wanted to mount the first DOS partition on the first
drive, you have two choices of devices:
/dev/hd0d or /dev/dsk/0sC.
I prefer the latter, since I have several DOS partition, some do not
have an equivalent for the first form. Therefore, by using
/dev/dsk/0sC, I am
consistant in the names I use. If we wanted to mount it onto
/usr/dos/c_drive, the command would be:
mount -f DOS /dev/dsk/0sC
The only issue with this is that in ODT, the file name were
all capitalized. In OpenServer, there is the lower option, which is
used to show all the file names in lower case. Therefore, the command
would look like this:
mount -f DOS -o lower
Although you can use the mkdev
fs script to add a DOS filesystem, it displays a couple of
annoying messages. Since I think it is just as easy to eadit
/etc/default/filesys, I do
so. There is also the issue that certain options are not possible
through the mkdev fs script
or the Filesystem manager. Therefore, I simply copy an existing entry
and end up with something like this:
mount=yes fstyp=DOS,lower \
fsckflags= rcmount=yes \
The key point is the fstyp
entry. Since we can specify mount options here, I specified the lower
option so that all filename's would come out in lower case. Each time
I go into multi-user mode, this filesystem is mounted for me. For
more details on the options here, check out the mount(ADM)
man-page or the section on filesystems. (Note: The lower
option is only available in OpenServer.)
Keep in mind that if the DOS filesystem that you are mounting
contains a compressed volume, you will not see the files with the
compressed volume. This applies to both ODT and OpenServer.
Another of the DOS commands that I use often is
dosformat. Although there are a few options (-v to promopt for
volume name, -q for quiet
mode, -f to run in
non-interactive mode), I never have used them. The one thing I need
to point out is that you format a UNIX floppy with the raw device
(e.g. /dev/rfd0), but with
dosformat, you format the
block device (e.g. /dev/fd0).
The remaining files, which I use only on occassion are:
dosrm- Removes files from
a DOS filesystem
- make a directory on a DOS filesystem
- move directories from a DOS filesystem
As with the kernel components, I suggest you go poke around the
system a little. Take a look at the files on your system that we
talked about to see what filesystems you have, where they are amount
and anything else you can find out about your system. Look for
different kinds of files. If they are hard links, try to find out
what other files are linked to it. If you find a symbolic link, take
a look at the file pointed to by that the symbolic link.
In every case, look at the file permissions. Think about why are they
set that way and what influence this has on their behavior. Also
think about the kind of file it is and who can access it. If you
aren't sure of what kind of file it is, you can use the file
command that can find out for you.
Next: Starting and Stopping the System
Copyright 1996-1998 by James Mohr. All rights reserved. Used by permission of
Be sure to visit Jim's great Linux Tutorial web site at http://www.linux-tutorial.info/