In Search of the Kernel
Previous: Linux Kernel: Introduction
Imagine that you were designing a computer that would be used for a multi-user OS like Linux. You've set your CPU with a 32 bit address space, so it can, in theory, reach 2^32 bytes (4 GB) of physical memory. You know that very often the CPU will be working with far less real memory, and you also know that while you could just avoid the whole concept of protecting memory in hardware, doing so wouldn't lead to a very stable OS, so you want to add some features to help out. What might you do?
If you have any sort of a software background at all, the concept of indirection immediately comes to mind. Rather than having the CPU directly access physical memory, make it so that each address is really a pointer. An analogy is the Unix inode structure: an inode contains pointers to disk blocks that contain actual file data. When a Unix program gets the inode number of a file, it reads data from a fixed size table (the inode table) and from that information, determines what disk blocks to read next. Note that the inode also contains permission information: if the program doesn't have the right credentials, it won't get access to the disk blocks.
You could apply the same concept to your CPU: a table that would contain both permissions and access to real physical memory locations. So, even though your CPU wants to access a specific address, your design forces that address to be be both checked and translated by this table. Our table takes care of both problems: we can check permissions (more on just how we'd do that later), and we can handle lack of physical memory. There's some details to be worked out, but the grand concept is there. Where do you put such a table? In memory, of course, but we just said that access to memory will be controlled by that table that we don't have yet. That's a circular trap, and there are only a couple of ways out of it:
- We could have an initial table with some well known addresses set up automatically by the CPU initialization. That table could be used while we setup the table we really want to use.
- We could do the same thing with the table stuck in ROM.
- We could have the CPU start up in a mode that doesn't use such address translation, and switch to that mode later. This is what Intel CPU's do; the non-table mode is called "real" mode; after switching, we're in "protected" mode.
Let's not worry too much about the details of that table just now, but we ought to define some terms. Because the entries in that table describe how and where we access memory, let's call the entries "descriptors" (another good reason is that is what Intel calls them!). The entries describe memory, so we call them descriptors, and therefore the table must be a Descriptor Table.Our CPU should have a register that tells it where in memory to find the table, so we'll call that (as Intel does), the Global Descriptor Table Register, or GDTR. Obviously we'd fill our table with the protection bits and pointers we want, stick the starting address in the GDTR, and then make the magic switch to protected mode.
But just what are we describing with our descriptors? You can't very well have a descriptor for every memory address: obviously that would take more memory than you have. No, each descriptor should define a range of memory. We'll call that range a "segment" and, of course, so does Intel.. funny how they picked the exact terms I use here, and did so years before I wrote this!
So our descriptor is going to have to include the starting address of the segment, how big it is, what kind of protection it has, and any other little details we decide to keep track of.
How many descriptors shall we leave room for? Arbitrarily, let's make it 8,192, and we'll set aside 8 bytes for each descriptor. This makes it easy for us to index into our table from the first 13 bits of an address. I know, I know, bits make your head hurt, you don't want to think about it, and either do I. But lets try to make it easy:
- A byte is 8 bits, and can address 256 locations- 0 to 255.
- 9 bits address 512 locations- 0 to 511.
- 10 bits address 1024 locations - 0 to 1023.
- 11 bits address 2048 locations - 0 to 2047.
- 12 bits address 4096 locations - 0 to 4095.
- 13 bits address 8192 locations - 0 to 8191.
We have 8,192 locations, each of which is 8 bytes long, so when our CPU wants a specific entry, it just needs to multiply by 8, which is a rather easy thing to do (all you have to do is move bits to the "left" once to multiply by 2, twice for 4, three times for 8..)
Would this give us a map of our entire 4GB address space? In other words, given any arbitrary 32 bit address, would looking at the first 13 bits multiplied by eight, and then added to the base stored in the GDTR, find an entry in this table? Of course, it would, because 13 bits is 8,192 addresses.
But then what? Well, let's take it slowly:
Our CPU wants data from a specific address. Our Segmentation Unit is going to intercept that request because we're in Protected mode. It will extract the first 13 bits of the address, shift the bits to multiply by 8, add that to what is stored in the GDTR register, and pull our descriptor from that memory address.
The descriptor is then examined to see if we have permission to access that memory (we haven't yet said how we decide that, but we'll get to that eventually). If we do, we then look at the size of the segment as described by the descriptor. That has to be larger than the remaining bits (the "offset" ) of the address the CPU wants. Lets look at that in detail:
The CPU wants address 16,384. Our GDTR is stored at 100,000 (isn't that convenient? Completely unrealistic, but convenient for this walk through). The high 13 bits of 16,384 are, conveniently enough, 0. That multiplied by 8, is still 0, so the segmentation unit looks at 100,000 plus 0, and finds a descriptor. This descriptors says that the real memory that belongs to this segment is actually located at address 200,000 and that the segment is 32,768 bytes long. That's good news for us, because we want to read the 16,384th byte, and that's within the segment. So the real memory at 216,384 is what the CPU actually gets, assuming that we have the right privilege. Privilege? What's that?
Privilege is what determines permission. We're obviously going to have two kinds of processes running on this system: kernel code, which should be able to access whatever it darn well pleases, and user code which should be controlled. We'll need a way to tell our CPU that it's in Kernel mode or User mode, and that information needs to be available to the Segmentation Unit so that it can react appropriately. So, in our artificial example here, if our descriptor said that this segment can only be accessed when in Kernel mode, we'd get refused access if we were running a user process. We'd also get refused if we had requested address 33,754, because that would be beyond the segment's length as stated by the descriptor.
Does all this sound pretty clumsy to you? It is, and in actuality, Linux tries to ignore this segmentation stuff as much as possible. To that end, it sets up only a few system descriptors, and those start at address 0 and have lengths that run right out to 4 GB. There are two code segments, and two data segments (one each for kernel and user use) and these totally overlap each other, each running from 0 to 4 GB. The very first descriptor is deliberately left empty, and there are a very few other descriptors used; most of the 8,192 table slots are available. However, prior to 2.4, every process had two enties in the Global Descriptor Table, and there were 14 used by the kernel, so the maximum number of process (NR_TASKS variable) was seriously constrained: that 8,192 figure we "arbitrarily" picked, is the actual maximum size of the Global Descriptor Table on real Intel CPU's. That limited user processes to something less than 4,090. The 2.4 design moves the per-process information elsewhere, where it can allocate whatever space it needs (still constrained, of course, by limits of reality-but the default for a 256MB machine would now be 16,384, and could be increased by simply putting a new value in /proc/sys/kernel/threads-max).
So let's review what happens when a process is running along executing code. The CPU has a memory access, and the first 13 bits of the address will cause the segmentation unit to look into the Global Descriptor Table. Whether it will be looking at one of the kernel mode or user mode descriptors of course depends on what the address is, and of course that address will have been carefully set so that it will be pointing at the right descriptor slot. The privilege level will be checked, the size of the segment will be checked against the rest of the address (what's left after using the first 13 bits), and that will give us the real physical address.
But wait a minute. This isn't going to work. Well, maybe it could if we really had 4 GB of memory, but chances are we don't. So, because of these giant segments, the segmentation unit could happily send us off to get memory that doesn't exist. Shouldn't we instead set up the segments so that they are limited to the amount of memory we really have?
Well, you could do that, but that means limiting the memory addresses that programs can use, and we'd either have to compile them that way to start with, or go through some complicated adjusting before we let them run. No, it's better to let programs use any valid 32 bit address, and have some other way to deal with the problem of not having enough real physical memory. That "other way" is called paging and will be the subject of the next installment in this series.
So, instead of this clumsy segmentation business, Linux primarily uses paging to manage memory. On Intel, segmentation is a necessary fact of life: it cannot be avoided entirely, because that's the way the CPU works. Worse, Intel doesn't just have that Global Descriptor Table, it also has what it calls the Local Descriptor Table.
The use of LDT's in Linux seems to be an area that baffles most people; I've yet to find any resource that doesn't obfuscate it or ignore it entirely. From what I can understand, it looks like processes get a default LTD that they generally share, but can create their own LDT's if necessary. The sorts of programs that might need to do that are emulators like Wine. The primary segment allocated to a process is what's called a Task State Segment, which is used to store the registers and other information needed to be kept for each process, but each process also gets a descriptor pointing to the default LDT. It is completely unclear to me why the default LTD is needed (other than for things like Wine). In searching about the web, I find some sources that seem to think the LDT is used, but most say the opposite. A process can create its own LDT's, so why do we need the default if it isn't used? For the moment at least, I don't have a clue.
In the next segment, we'll look at paging.
Previous: Linux Kernel: Introduction
Got something to add? Send me email.
(OLDER) <- More Stuff -> (NEWER) (NEWEST)
Printer Friendly Version
Increase ad revenue 50-250% with Ezoic