Swap and Dump

This is part of a series of articles that covers the booting of an OSR5 machine. See Booting OSR5 for other related articles.

Thanks to Bela Lubkin for some comments and suggestions that helped clarify this article.


Hate these ads?

Normally, the swap and dump device are identical. You could change that by editing the file /etc/conf/cf.d/sassign and relinking a new kernel. But why would you want to? There might be reasons, but first we need to understand what swap and dump are for.

What are these for, anyway?

Dump is pretty simple. Its only purpose is to receive a kernel dump. Therefore there are two immediately obvious things to be said about dump: unless you are a kernel or driver developer who expects to be regularly crashing your system, you may NEVER need dump at all, and secondly, if you do ever need it, it had better be big enough to hold everything currently in memory. That's what a dump is: the contents of memory. If you have 256 MB of memory, you'd need 256 MB of dump space.

What if you had 4 GiB of memory? OSR5 does support that, unlikely as it used to be for most of us. Funny that it's file size is still 2GiB, while memory supports the full 4GiB. But there's a problem there: to analyze a system dump, you normally would copy it (using dd ) from the swap or dump device to a file on disk. But since the maximum size of a file is currently 2 GiB on OSR5, you could only copy half of a 4GiB dump. Given that, it would seem that there would be little value in having dump be more than 2 GiB no matter how much memory you had.

However, it turns out that "crash" (which is what you'd use to analyze the dump) can read directly from the dump device- you don't have to transfer it to a file. So in the case of a machine with more than 2 GiB of RAM, you might very well want a separate dump device. Even if you aren't capable of crash analysis, you could write the dump directly from the dump device to tape so that it could be sent to someone else. Are you likely to do that? Probably not. When most of us have panic problems, we either have a pretty good idea where it's coming from (because we just added something) or we just start ripping things out until the problem goes away.

The most common crash is a Panic Trap 0x0000000E, and that's often bad memory, which is both cheap and easy to fix.

Given the uses your machine is put to, what are the chances that you'd run "crash" to debug a dump or pay money for someone else to do it? Maybe you don't need that dump space at all? Remember though: if you have a crash and don't have enough dump space, it may be too late to change your mind.






You can test dumping with the "sysdump" program. This is actually a tool for copying, compressing and uncompressing dumps, but it can also be used to create a dump on demand. That could be useful if a kernel needs to be professionally examined or if you just want to test your dump device. You could, for example, run



/etc/sysdump -i /dev/mem -n /unix -o /dev/swap


Bela Lubkin commented on that:




This is a decent test, but!  If you do this to a running system which is
actually swapping, you will have a big problem.  You should have them
run `swap -l` first to make sure swap is idle.  You could then go into a
big discussion of what to do if it isn't, but I think that would be a
distraction.  (If it isn't: could run `swap -d /dev/swap` and see if
that works -- it will start shoving stuff in from swap, and it'll
succeed if there's enough RAM, as would be the case if you swapped once
for a little while but current memory requirements fit within RAM.
Otherwise they could make a swap _file_ and `swap -a` it, then `swap -d
/dev/swap` and it'll shuffle stuff from one to the other.)



But as I said, a distraction.  All you really need is to warn them --
don't try this if swap is in use ("blocks" != "free" in `swap -l


You specify where dump is to go by a "dump" keyword (dump=/dev/mydump) in /etc/default/boot or passed on the boot command line. Unlike swap, dump can't span multiple devices or use a file in the filesystem. You can also say "dump=none" if you definitely don't want to save anything.

See http://aplawrence.com/cgi-bin/ta.pl?105935 for more information.

If you really want to test by creating an actual panic, see http://aplawrence.com/cgi-bin/ta.pl?103679.

Swap is more complex. Swap actually has two purposes: to store processes if and when the kernel gets so low on free memory that it has to swap them out, and to serve as backing store for virtual memory. And that is a concept that is often misunderstood. For example, it is "common knowledge" that you need as much swap as you have memory, even if you have a separate dump device. It turns out that that is not true, at least not on OSR5 systems.

As I write this, my machine has 128 MB of ram and just 1 MB of swap configured. Here's "swap -l" and "memsize":



# swap -l
path               dev  swaplo blocks   free
/dev/swap          1,41      0   2000   2000
# memsize
129626112


You might think I couldn't do much with that configuration, but that's not the case. Here's what's happening right now:

(view ps -l output)

Notice that there's an X session with Netscape running on tty02 and an Edge backup going on on tty03, plus this editing on tty01, plus all the system stuff including Squid and Apache! Another interesting thing to note is the sum of the "size" column:



# ps -e -o size| awk '{ sum += $1
}
END { print "Sum", sum }'
Sum 62256


There's over 62 megabytes worth of programs being run right now; "sar -r" confirms that:



# sar -r 1 1
SCO_SV scobox 3.2v5.0.4 Pentium    09/30/99



12:14:56 freemem freeswp (-r)
12:14:57   13942    2000


Memory is expressed in 4k pages here and swap is 512 byte blocks, so that's 57,106,432 bytes of free memory and 1 MB of swap.

This happens to be a 5.0.4 machine. On 5.0.5, sar -r shows two other colums- we'll get to those shortly.

There are 13942 free pages, which means that (13942 * 4096) 57,106,432 bytes are free. The system started out (after loading the kernel and allocating its buffers and variables) with 27,589 pages.

You get that from availrmem- on 5.0.5 that shows up in sar -r, on older systems you can do:



# echo "od -d availrmem" | crash
dumpfile = /dev/mem, namelist = /unix, outfile = stdout
f020e120:  0000027589   



The figure doesn't remain completely constant, but will remain close to the same amount most of the time.

As 27589 - 13942 = 13647, and that times 4096 is 55,898,112, obviously some memory usage changed between the times I did these samples (the total available user memory minus the currently free pages should be the memory in use). This script tries to get closer to making everything agree:



echo "od -d freemem" | crash &
ps -e -o size| awk '{ sum += $1
}
END { print "Sum", sum }'


That "freemem" is the same thing sar -r reports. I put the "crash" session in background so that it has a chance of being included in the ps output; the results are:



Sum 66476
dumpfile = /dev/mem, namelist = /unix, outfile = stdout
f020e118:  0000012961   


That comes out closer, but you'll never get exact on a system that's working while you are measuring. The important point is that here we have a system with 60 megabytes or so of memory in use, and it is running quite happily with 1 MB of swap. Why?

The confusion of the "common knowledge" is due to the fact that virtual memory is what is really important, and virtual memory is the sum of available memory and available swap. With nothing running, that would be availrmem plus swap. Obviously you never have nothing running, so the sum of available ram and available swap is kept track of in a kernel variable "availsmem" (note "availSmem" vs. "availRmem"). Prior to R5.0.5, the only way to find out what the value of it was at any time was to run



echo "od -d availsmem" | crash


Starting with 5.0.5, "sar -r" lists availsmem and availrmem (amount of ram not being used by the kernel). So lets do some testing to see what happens here when we ask for even more memory. First a little shell script:



# cat once
#!/bin/sh
# "once"
echo availsmem freeswap freemem
echo "od -d availsmem
od -d freeswap
od -d freemem" | crash
swap -l
ps -l
# ./once
availsmem freeswap freemem
dumpfile = /dev/mem, namelist = /unix, outfile = stdout
f020e124:  0000012163   
f020e11c:  0000000250   
f020e118:  0000012795   
path               dev  swaplo blocks   free
/dev/swap          1,41      0   2000   2000
  F S    UID   PID  PPID  C PRI NI     ADDR   SZ     WCHAN     TTY        TIME CMD
 20 S      0   752     1  0  73 20 fb117968  160  fb117968   tty01    00:00:00 login
 20 S      0  1618   799  1  76 20 fb11b080  240  f0f20300   tty01    00:00:04 vi
 20 S      0   792   752  0  73 20 fb11d218   68  fb11d218   tty01    00:00:00 sh
 20 S      0   799   792  0  73 20 fb11d370  128  fb11d370   tty01    00:00:00 ksh
 20 S      0  2009  1618  1  73 20 fb11f258   60  fb11f258   tty01    00:00:00 sh
 20 S      0  2010  2009 29  73 20 fb11f3b0   60  fb11f3b0   tty01    00:00:00 sh
 20 O      0  2014  2010  6  48 20 fb11f508  148         -   tty01    00:00:00 ps


That will give us a quick snapshot of what's happening. Now lets write some C programs to use some memory. The first allocates a 2 MB buffer on its stack, the second uses a static array. Both of them call the "once" script three times while running:



# cat stackarray.c
/* stackarray.c */
#include <stdlib.h>
main()
{
system("./once");
memfunc();
outfunc();
exit(0);
}
outfunc() {
        system("./once");
}
memfunc()
{
char array[2 * 1024 * 1024];
        outfunc();
}



# cat staticarray.c



/* staticarray.c
#include <stdlib.h>
main()
{
system("./once");
memfunc();
outfunc();
exit(0);
}
outfunc() {
        system("./once");
}
memfunc()
{
static char array[2 * 1024 * 1024];
        outfunc();
}



# cc -o staticarray staticarray.c
# cc -o stackarray stackarray.c
# size stackarray.c staticarray
stackarray: 26396 + 4312 + 440 = 31148
staticarray: 26392 + 4312 + 2097592 = 2128296


Note the difference between these in the last (.bss column). That's because the 2 MB array won't be setup until stackarray runs.



# ./stackarray
availsmem freeswap freemem
dumpfile = /dev/mem, namelist = /unix, outfile = stdout
f020e124:  0000014629   
f020e11c:  0000000250   
f020e118:  0000015313   
path               dev  swaplo blocks   free
/dev/swap          1,41      0   2000   2000
  F S    UID   PID  PPID  C PRI NI     ADDR   SZ     WCHAN     TTY        TIME CMD
 20 S      0  2067  2041  0  73 20 fb11b080   16  fb11b080   ttyp1    00:00:00 stackarray
 20 S      0  2068  2067  0  73 20 fb11f258   60  fb11f258   ttyp1    00:00:00 sh
 20 S      0  2041  2040  0  73 20 fb11f660   60  fb11f660   ttyp1    00:00:00 sh
 20 S      0  2069  2068  9  73 20 fb11f7b8   60  fb11f7b8   ttyp1    00:00:00 sh
 20 O      0  2073  2069  8  48 20 fb11f910  148         -   ttyp1    00:00:00 ps


Here we see that stackarray has only used 16K of memory (size) when it first loads.




availsmem freeswap freemem
dumpfile = /dev/mem, namelist = /unix, outfile = stdout
f020e124:  0000014117   
f020e11c:  0000000250   
f020e118:  0000015312   
path               dev  swaplo blocks   free
/dev/swap          1,41      0   2000   2000
  F S    UID   PID  PPID  C PRI NI     ADDR   SZ     WCHAN     TTY        TIME CMD
 20 S      0  2067  2041 10  73 20 fb11b080 2064  fb11b080   ttyp1    00:00:00 stackarray
 20 S      0  2074  2067  1  73 20 fb11f258   60  fb11f258   ttyp1    00:00:00 sh
 20 S      0  2041  2040  0  73 20 fb11f660   60  fb11f660   ttyp1    00:00:00 sh
 20 S      0  2075  2074 21  73 20 fb11f7b8   60  fb11f7b8   ttyp1    00:00:00 sh
 20 O      0  2079  2075  7  48 20 fb11f910  148         -   ttyp1    00:00:00 ps


After the function is called, memory usage goes up to 2064K, and notice that availsmem goes down accordingly (14117 vs 14629). But "freemem" stays about the same, because we really haven't done anything with those pages yet- they are allocated, which affects availsmem, but no physical RAM has been assigned to them, and won't be unless and until we write something into them- which we don't in this test.



availsmem freeswap freemem
dumpfile = /dev/mem, namelist = /unix, outfile = stdout
f020e124:  0000014117   
f020e11c:  0000000250   
f020e118:  0000015312   
path               dev  swaplo blocks   free
/dev/swap          1,41      0   2000   2000
  F S    UID   PID  PPID  C PRI NI     ADDR   SZ     WCHAN     TTY        TIME CMD
 20 R      0  2067  2041 20  43 20 fb11b080 2064         -   ttyp1    00:00:00 stackarray
 20 S      0  2080  2067  2  73 20 fb11f258   60  fb11f258   ttyp1    00:00:00 sh
 20 S      0  2041  2040  0  73 20 fb11f660   60  fb11f660   ttyp1    00:00:00 sh
 20 R      0  2081  2080 18  44 20 fb11f7b8   60         -   ttyp1    00:00:00 sh
 20 O      0  2085  2081  6  48 20 fb11f910  148         -   ttyp1    00:00:00 ps


After the function returns, the space is still being shown as used, but of course it all comes back when the program exits:



# ./once
availsmem freeswap freemem
dumpfile = /dev/mem, namelist = /unix, outfile = stdout
f020e124:  0000014657   
f020e11c:  0000000250   
f020e118:  0000015345   
path               dev  swaplo blocks   free
/dev/swap          1,41      0   2000   2000
  F S    UID   PID  PPID  C PRI NI     ADDR   SZ     WCHAN     TTY        TIME CMD
 20 R      0  2086  2041 30  39 20 fb11b080   60         -   ttyp1    00:00:00 sh
 20 O      0  2090  2086  7  48 20 fb11f258  148         -   ttyp1    00:00:00 ps
 20 S      0  2041  2040  4  73 20 fb11f660   60  fb11f660   ttyp1    00:00:00 sh


Now lets try the static array:



# ./staticarray
availsmem freeswap freemem
dumpfile = /dev/mem, namelist = /unix, outfile = stdout
f020e124:  0000014117   
f020e11c:  0000000250   
f020e118:  0000015312   
path               dev  swaplo blocks   free
/dev/swap          1,41      0   2000   2000
  F S    UID   PID  PPID  C PRI NI     ADDR   SZ     WCHAN     TTY        TIME CMD
 20 S      0  2091  2041  2  73 20 fb11b080 2064  fb11b080   ttyp1    00:00:00 staticarray
 20 S      0  2092  2091  2  73 20 fb11f258   60  fb11f258   ttyp1    00:00:00 sh
 20 S      0  2041  2040  1  73 20 fb11f660   60  fb11f660   ttyp1    00:00:00 sh
 20 R      0  2093  2092 31  39 20 fb11f7b8   60         -   ttyp1    00:00:00 sh
 20 O      0  2097  2093  7  48 20 fb11f910  148         -   ttyp1    00:00:00 ps


The immediate difference is that the memory use shows up right away as we'd expect. Still no usage of real RAM, and for the same reason.



availsmem freeswap freemem
dumpfile = /dev/mem, namelist = /unix, outfile = stdout
f020e124:  0000014117   
f020e11c:  0000000250   
f020e118:  0000015312   
path               dev  swaplo blocks   free
/dev/swap          1,41      0   2000   2000
  F S    UID   PID  PPID  C PRI NI     ADDR   SZ     WCHAN     TTY        TIME CMD
 20 R      0  2091  2041 22  42 20 fb11b080 2064         -   ttyp1    00:00:00 staticarray
 20 S      0  2098  2091  1  73 20 fb11f258   60  fb11f258   ttyp1    00:00:00 sh
 20 S      0  2041  2040  0  73 20 fb11f660   60  fb11f660   ttyp1    00:00:00 sh
 20 R      0  2099  2098 17  45 20 fb11f7b8   60         -   ttyp1    00:00:00 sh
 20 O      0  2103  2099  7  48 20 fb11f910  148         -   ttyp1    00:00:00 ps
availsmem freeswap freemem
dumpfile = /dev/mem, namelist = /unix, outfile = stdout
f020e124:  0000014117   
f020e11c:  0000000250   
f020e118:  0000015312   
path               dev  swaplo blocks   free
/dev/swap          1,41      0   2000   2000
  F S    UID   PID  PPID  C PRI NI     ADDR   SZ     WCHAN     TTY        TIME CMD
 20 R      0  2091  2041 50  32 20 fb11b080 2064         -   ttyp1    00:00:00 staticarray
 20 S      0  2104  2091  3  73 20 fb11f258   60  fb11f258   ttyp1    00:00:00 sh
 20 S      0  2041  2040  0  73 20 fb11f660   60  fb11f660   ttyp1    00:00:00 sh
 20 R      0  2105  2104 28  40 20 fb11f7b8   60         -   ttyp1    00:00:00 sh
 20 O      0  2109  2105  5  49 20 fb11f910  148         -   ttyp1    00:00:00 ps



Note in all of this, we still only had 1 MB of swap to work with. In these programs alone we allocated more than 2 MB of space, not even counting the 50 or 60 megabytes being used for other programs. This proves that you do not need swap for virtual memory if you have sufficient real memory. Also note that "swap -l" never changes, because no swap has been used (swap -l wouldn't show you vm usage anyway).

What happens if we turn things upside down? To find out, I put swap back at 128 MB, and forced memory to 48 MB by typing



mem=0k-639k,1m-16m,16m-48m/s/n


at the boot prompt before booting. The "once" program shows this before starting up anything other than the single login:



# ./once
availsmem freswap freemem
dumpfile = /dev/mem, namelist = /unix, outfile = stdout
f020e124:  0000036463   
f020e11c:  0000032000   
f020e118:  0000005219   
path               dev  swaplo blocks   free
/dev/swap          1,41      0 256000 256000
  F S    UID   PID  PPID  C PRI NI     ADDR   SZ     WCHAN     TTY        TIME CMD
 20 S      0   809   808  1  73 20 fb11dcd8   60  fb11dcd8   ttyp0    00:00:00 sh
 20 S      0   817   809 33  73 20 fb11de30   60  fb11de30   ttyp0    00:00:00 sh
 20 O      0   821   817  5  49 20 fb11e238  148         -   ttyp0    00:00:00 ps



No swapping, 20 meg or so free. Now start up X and Netscape:



# ./once
availsmem freswap freemem
dumpfile = /dev/mem, namelist = /unix, outfile = stdout
f020e124:  0000030911   
f020e11c:  0000031946   
f020e118:  0000000039   
path               dev  swaplo blocks   free
/dev/swap          1,41      0 256000 255576
  F S    UID   PID  PPID  C PRI NI     ADDR   SZ     WCHAN     TTY        TIME CMD
 20 S      0   940   809 17  73 20 fb119f08   60  fb119f08   ttyp0    00:00:00 sh
 20 S      0   809   808  0  73 20 fb11dcd8   60  fb11dcd8   ttyp0    00:00:00 sh
 20 O      0   944   940  9  48 20 fb11ea48  148         -   ttyp0    00:00:00 ps


It had to use a little swap to get Netscape up.



# ./stackarray
availsmem freswap freemem
dumpfile = /dev/mem, namelist = /unix, outfile = stdout
f020e124:  0000030873   
f020e11c:  0000031944   
f020e118:  0000000044   
path               dev  swaplo blocks   free
/dev/swap          1,41      0 256000 255552
  F S    UID   PID  PPID  C PRI NI     ADDR   SZ     WCHAN     TTY        TIME CMD
 20 S      0   946   809  3  73 20 fb119f08   16  fb119f08   ttyp0    00:00:00 stackarray
 20 S      0   809   808  2  73 20 fb11dcd8   60  fb11dcd8   ttyp0    00:00:00 sh
 20 S      0   947   946  3  73 20 fb11ea48   60  fb11ea48   ttyp0    00:00:00 sh
 20 S      0   948   947 31  73 20 fb11ecf8   60  fb11ecf8   ttyp0    00:00:00 sh
 20 O      0   952   948  6  48 20 fb11ee50  148         -   ttyp0    00:00:00 ps
availsmem freswap freemem
dumpfile = /dev/mem, namelist = /unix, outfile = stdout
f020e124:  0000030361   
f020e11c:  0000031944   
f020e118:  0000000043   
path               dev  swaplo blocks   free
/dev/swap          1,41      0 256000 255552
  F S    UID   PID  PPID  C PRI NI     ADDR   SZ     WCHAN     TTY        TIME CMD
 20 S      0   946   809 22  73 20 fb119f08 2064  fb119f08   ttyp0    00:00:00 stackarray
 20 S      0   809   808  1  73 20 fb11dcd8   60  fb11dcd8   ttyp0    00:00:00 sh
 20 S      0   953   946  1  73 20 fb11ea48   60  fb11ea48   ttyp0    00:00:00 sh
 20 R      0   954   953 26  41 20 fb11ecf8   60         -   ttyp0    00:00:00 sh
 20 O      0   958   954  7  48 20 fb11ee50  148         -   ttyp0    00:00:00 ps
availsmem freswap freemem
dumpfile = /dev/mem, namelist = /unix, outfile = stdout
f020e124:  0000030361   
f020e11c:  0000031944   
f020e118:  0000000043   
path               dev  swaplo blocks   free
/dev/swap          1,41      0 256000 255552
  F S    UID   PID  PPID  C PRI NI     ADDR   SZ     WCHAN     TTY        TIME CMD
 20 R      0   946   809 58  29 20 fb119f08 2064         -   ttyp0    00:00:00 stackarray
 20 S      0   809   808  1  73 20 fb11dcd8   60  fb11dcd8   ttyp0    00:00:00 sh
 20 S      0   959   946  3  73 20 fb11ea48   60  fb11ea48   ttyp0    00:00:00 sh
 20 R      0   960   959 30  39 20 fb11ecf8   60         -   ttyp0    00:00:00 sh
 20 O      0   964   960  6  48 20 fb11ee50  148         -   ttyp0    00:00:00 ps


Notice that swap changes a little bit, but stays the same as availsmem goes down. This shows that "swap -l" means nothing with regard to availsmem- these are entirely separate statistics.

But what if you use the memory?

Now lets try something else. We'll modify the "stackarray" code so that it actually uses the memory:



/* stackarray.c with actual use of array */
#include <stdlib.h>
main()
{
system("./once"); 
memfunc();
outfunc();
}
outfunc() {
        system("./once");
}
memfunc()
{
int x;
char array[2 * 1024 * 1024];
        outfunc();
        for (x=0; x < 2 * 1024 * 1024; x+= 4096) {
                array[x]=x;
        }
}






When we run it, there's an interesting difference: notice that "freemem" goes down after the memory is actually used, but "availsmem" remains the same throughout. That's because until we actually put something in the array, it's just pointers to virtual memory- no real memory gets allocated until we really need it. This run is with 128 MB of memory and 128 MB of swap, but it shows what actually happens (there is no difference when run with 1 MB of swap- only the "swap" figures change):



availsmem freswap freemem
dumpfile = /dev/mem, namelist = /unix, outfile = stdout
f020e124:  0000047704   
f020e11c:  0000032000   
f020e118:  0000016512   
path               dev  swaplo blocks   free
/dev/swap          1,41      0 256000 256000
  F S    UID   PID  PPID  C PRI NI     ADDR   SZ     WCHAN     TTY        TIME CMD
 20 S      0   777     1  0  73 20 fb118988  160  fb118988   tty03    00:00:00 login
 20 S      0  1344   777  0  73 20 fb11b080   68  fb11b080   tty03    00:00:00 sh
 20 S      0  1351  1344  1  73 20 fb11e390  128  fb11e390   tty03    00:00:01 ksh
 20 S      0  2190  1351  1  73 20 fb11e798   16  fb11e798   tty03    00:00:00 stackarray
 20 S      0  2191  2190  1  73 20 fb11f3b0   60  fb11f3b0   tty03    00:00:00 sh
 20 R      0  2192  2191 21  43 20 fb11f508   60         -   tty03    00:00:00 sh
 20 O      0  2196  2192  6  48 20 fb11f660  148         -   tty03    00:00:00 ps
availsmem freswap freemem
dumpfile = /dev/mem, namelist = /unix, outfile = stdout
f020e124:  0000047192   
f020e11c:  0000032000   
f020e118:  0000016511   
path               dev  swaplo blocks   free
/dev/swap          1,41      0 256000 256000
  F S    UID   PID  PPID  C PRI NI     ADDR   SZ     WCHAN     TTY        TIME CMD
 20 S      0   777     1  0  73 20 fb118988  160  fb118988   tty03    00:00:00 login
 20 S      0  1344   777  0  73 20 fb11b080   68  fb11b080   tty03    00:00:00 sh
 20 S      0  1351  1344  1  73 20 fb11e390  128  fb11e390   tty03    00:00:01 ksh
 20 R      0  2190  1351 31  39 20 fb11e798 2064         -   tty03    00:00:00 stackarray
 20 S      0  2197  2190  4  73 20 fb11f3b0   60  fb11f3b0   tty03    00:00:00 sh
 20 R      0  2198  2197 30  39 20 fb11f508   60         -   tty03    00:00:00 sh
 20 O      0  2202  2198  6  48 20 fb11f660  148         -   tty03    00:00:00 ps
availsmem freswap freemem
dumpfile = /dev/mem, namelist = /unix, outfile = stdout
f020e124:  0000047192   
f020e11c:  0000032000   
f020e118:  0000016000   
path               dev  swaplo blocks   free
/dev/swap          1,41      0 256000 256000
  F S    UID   PID  PPID  C PRI NI     ADDR   SZ     WCHAN     TTY        TIME CMD
 20 S      0   777     1  0  73 20 fb118988  160  fb118988   tty03    00:00:00 login
 20 S      0  1344   777  0  73 20 fb11b080   68  fb11b080   tty03    00:00:00 sh
 20 S      0  1351  1344  1  73 20 fb11e390  128  fb11e390   tty03    00:00:01 ksh
 20 R      0  2190  1351 79  21 20 fb11e798 2064         -   tty03    00:00:00 stackarray
 20 S      0  2203  2190  3  73 20 fb11f3b0   60  fb11f3b0   tty03    00:00:00 sh
 20 R      0  2204  2203 29  40 20 fb11f508   60         -   tty03    00:00:00 sh
 20 O      0  2208  2204  6  48 20 fb11f660  148         -   tty03    00:00:00 ps


Note that this also means that if a program requests (allocates) but does not actually use memory (as was the case in the previous tests), you can have the strange circumstance where you have free memory (because no physical pages have been allocated) but you can't run any more programs because you have run out of virtual memory (availsmem). That doesn't mean you need more swap, it means you need more availsmem- adding either more swap or more real RAM will fix the problem.

Of course, if you want to run that right now, adding swap is easier than adding memory. To add swap, you could just do:



touch /mynewswapfile
swap -a /mynewswapfile 256000



which instantly and magically adds another 128 MB of swap to your system. Note that's not permanent; you'd need to redo it at every boot.

So how much swap do you need? Who knows? What kind of programs are you running? How much data and stack do they need? How much dump space do you think you'll need? That's the only way you could try to really figure it out; most folks just add 50% to memory and hope for the best. But if the only use is for dump, does that make any sense? Will you just be adding 50% more memory?

With today's large hard drives, it doesn't cost much to configure more swap than you think you'll ever need. When does it become ridiculous though? If you currently run 128 MB of memory, should you size for potentially having 256 MB? 512? A gigabyte?

If you have a separate dump device, and lots of real memory, you may not need much swap at all. I think I'd always configure some, just in case there's some kernel code somewhere that expects it, but that may not even be necessary- in fact, it's darn unlikely. You should certainly understand (despite the "common knowledge" to the contrary) that virtual memory doesn't NEED swap- swap can and will be used for vm, but it isn't REQUIRED. As for dump space, it's needed if you ever need it AND you expect to be analyzing the results. Otherwise, it truly is wasted space, and while today's hard drives are inexpensive, you might have better use for that space.


Bela Lubkin was kind enough to make some comments and suggestions on this article which caused to me to rewrite a few sections of it trying to make things clearer. Whether I succeeded or not, I thought it would be good to add his actual comments here also, and he agreed to publish his email. What follows is extractions from those emails with explanatory comments from me in italics.

Here I had said that I hadn't stressed "freemem" in the original article because it didn't seem important to me in the context of writing about swap:

(Bela)

It is definitely important. freemem measures the amount of actual RAM that isn't currently in use, while availsmem measures the amount of virtual space that hasn't yet been promised to someone. availsmem is the upper bound on how much [measured in terms of memory usage] you can run at all. freemem is the upper bound on how much you can run without actually performing swap I/O, which is rather costly in performance. If you were to graph performance vs. memory usage, you would see something like this:




100% |====================================
     |                                   /=====
     |                                 (1)     =====
     |                                              =====
     |                                                  /==
     | (1) memory getting tight(*); kernel starts to  (2)  ==
     |     page non-dirty pages out of executable            ==
     |     binaries and other such read-only sources           ==
     |                                                           ==
     | (2) freemem approaching 0 (crosses GPGSHI), kernel         /=
     |     starts paging dirty pages out to swap                (3)=
     |                                                             =
     | (3) all inactive pages have been written out to swap(@);     =
     |     active pages start getting written out to swap;          =
     |     system starts to thrash                                  =
     |
  0% +-----------------------------------------------------------------------------



       (*) I'm not sure what the exact technical threshold is here.  It
           isn't GPGSHI, it isn't MINASMEM...



       (@) This isn't a technical threshold; more of a user tolerance
           threshold.  As you run more stuff, and as that stuff touches
           more of its memory more frequently, and as a higher
           percentage of that memory gets pushed out to swap,
           performance is going to degrade rapidly until the user finds
           it intolerable.



           Also, there's no real distinction between "active" and "not
           active" pages, in this context.  The question is, on average,
           how long will it be before this page that's being written out
           to swap will be needed in RAM again?  The kernel has
           strategies which make this average quite high when things
           aren't too tight.  As memory tightens, the average
           necessarily goes down.  When a significant portion of memory
           accesses actually become disk accesses, performance is
           extremely degraded; time to add more RAM.



freemem is important for performance; availsmem is important for being
able to run things at all -- quickly or not.



=============================================================================


The independence of these variables is also quite confusing. For instance, availsmem can approach 0 while freemem is still a large number. It's easy -- just run a lot of programs (like the examples in this question) which *allocate* a lot of memory, but never touch it. Suppose a system had 64MB RAM and 256MB swap. It would start out with availsmem around 80000 (*4K) and freemem around 15000. Now run 10 instances of a simple program that allocates 30MB of RAM, but doesn't touch it. These will decrement availsmem by about 300MB == 75000, leaving about 5000. But they won't take up an appreciable amount of real RAM, so freemem's still around 15000. Now try to run one more copy. There is still plenty of freemem; `sar -r` "freemem" looks fine. But you get EAGAIN because the program can't allocate another 7500 pages of availsmem.

What good is this mechanism doing? Well, what if all those programs *did* suddenly start touching their memory. The kernel would have to find actual backing store for those pages -- either RAM or swap. availsmem tells it how many pages of that backing store are not yet claimed. Thus, it can prevent a process from starting, which will require memory that might eventually not be available. The mechanism has an underlying assumption that processes will *not*, in general, allocate huge amounts of memory that they won't actually use. When that assumption is broken, the mechanism is over-protective -- it prevents you from using your RAM just because someone is hogging (and not using) address space.

I had responded with:



> And there's a common misunderstanding: most folks seem to
> believe that it has to be swap, and it doesn't.  If it did,
> I'd never be able to run 60 MB of programs when swap was 1
> MB.    I could be wrong, but I believe that the source of
> this is that older Unices (like Sun 4.x releases) actually
> DID require swap space, and couldn't use ram- but I don't
> have one of those anymore to mess with, so I can't be sure..



(Bela)

Some versions of Unix use static mappings of virtual space to swap space. The act of allocating virtual space (whether through malloc() (== [s]brk()), growing your stack, fork(), or initial mapping of a process's .data and .bss) also allocates matching pages of swap. In such a system, the total weight of processes that you can run at once equals your swap space. OSR5 doesn't bind virtual space to specific swap pages; as a result, it can use *all* of the potential backing store -- both RAM and swap -- to hold the combined weight of processes.

Then I asked:



> One more thing: suppose you actually had 4 gig of RAM.  I'd
> assume, given what I think I know about OSR5, that there
> would be no point in having swap (assuming separate dump) at
> all- that availsmem couldn't exceed 4 gig anyway?  Not
> anything  I can test with my cheap hardware!



availsmem is counted in 4K pages. 4GiB == 1048576. So availsmem itself doesn't limit (RAM + swap) to 4GiB.

I don't know the answer to your broader question. In principle, I see no reason you couldn't have multiple large swap spaces. No one swap area can be larger than 4GiB, but the total size of swap can be much larger. As long as the swap page number can be stored in a 32-bit integer, it should be usable. Of course, you could never have more than 4GiB worth of it in memory at once -- so if you were really using all that swap, the ratio of slow-access to fast-access pages would be bad -- performance would bite.

As long as I had him on the hook, I figured I might as well ask all my questions:



> Sun used to give a swap sizing guide for Solaris 2.x that
> went down as memory increased, and swap disappeared entirely
> eventually.  Is there anything in OSR5 that would assume the
> existence of a swap device and cause a problem if there
> wasn't any?



I don't know, but I've deliberately run test systems with no swap at all (boot keyword "swap=none") for long enough to feel reasonably safe about it.

Publish your articles, comments, book reviews or opinions here!

© September 1999 A.P. Lawrence. All rights reserved




Comments /Boot/swap.html
BootSwap :

Add your comments

Enter your email address for automatic notification of new posts here
(be sure to whitelist 'feedburner.com' if you use spam filtering)

Or use any RSS reader

Delivered by FeedBurner


Views for this page
Today This Week This Month This Year  Overall
112271,157 13,092

Have you tried Searching this site?

Unix/Linux/Mac OS X support by phone, email or on-site: Support Rates

This is a Unix/Linux resource website. It contains technical articles about Unix, Linux and general computing related subjects, opinion, news, help files, how-to's, tutorials and more. We appreciate comments and article submissions.

Publishing your articles here

pavatar.jpg
More:
       - Kernel/Internals
       - Disks/Filesystems
       - Installation/Upgrades
       - OSR5




Unix/Linux Consultants

Your ad here - $24.00 yearly!

http://www.breakthru.com.au SCO (Openserver and Unixware), Unix, Solaris and Linux Consulting services including: Secure Networking Solutions; Linux based Firewalls; Backup Solutions; Secure Home to Office Network Setup; Phone, Remote and On-Site Support available - Satisfaction Guaranteed!


http://www.vss3.com SCO/Caldera OpenServer, Unixware & Linux. Tarantella & Non-stop Clustering


http://echo3.net/ Unix/Linux Custom Applications, Web Hosting, C/C++ Programming Courses




card_image








Change Congress


Related Posts