APLawrence.com -  Resources for Unix and Linux Systems, Bloggers and the self-employed
RSS Feeds RSS Feeds











(OLDER) <- More Stuff -> (NEWER) (NEWEST)
Printer Friendly Version
->
-> Swap and Dump


Swap and Dump




This is part of a series of articles that covers the booting of an OSR5 machine. See Booting OSR5 for other related articles.

Thanks to Bela Lubkin for some comments and suggestions that helped clarify this article.

Normally, the swap and dump device are identical. You could change that by editing the file /etc/conf/cf.d/sassign and relinking a new kernel. But why would you want to? There might be reasons, but first we need to understand what swap and dump are for.

What are these for, anyway?

Dump is pretty simple. Its only purpose is to receive a kernel dump. Therefore there are two immediately obvious things to be said about dump: unless you are a kernel or driver developer who expects to be regularly crashing your system, you may NEVER need dump at all, and secondly, if you do ever need it, it had better be big enough to hold everything currently in memory. That's what a dump is: the contents of memory. If you have 256 MB of memory, you'd need 256 MB of dump space.

What if you had 4 GiB of memory? OSR5 does support that, unlikely as it used to be for most of us. Funny that it's file size is still 2GiB, while memory supports the full 4GiB. But there's a problem there: to analyze a system dump, you normally would copy it (using dd ) from the swap or dump device to a file on disk. But since the maximum size of a file is currently 2 GiB on OSR5, you could only copy half of a 4GiB dump. Given that, it would seem that there would be little value in having dump be more than 2 GiB no matter how much memory you had.

However, it turns out that "crash" (which is what you'd use to analyze the dump) can read directly from the dump device- you don't have to transfer it to a file. So in the case of a machine with more than 2 GiB of RAM, you might very well want a separate dump device. Even if you aren't capable of crash analysis, you could write the dump directly from the dump device to tape so that it could be sent to someone else. Are you likely to do that? Probably not. When most of us have panic problems, we either have a pretty good idea where it's coming from (because we just added something) or we just start ripping things out until the problem goes away.

The most common crash is a Panic Trap 0x0000000E, and that's often bad memory, which is both cheap and easy to fix.

Given the uses your machine is put to, what are the chances that you'd run "crash" to debug a dump or pay money for someone else to do it? Maybe you don't need that dump space at all? Remember though: if you have a crash and don't have enough dump space, it may be too late to change your mind.

You can test dumping with the "sysdump" program. This is actually a tool for copying, compressing and uncompressing dumps, but it can also be used to create a dump on demand. That could be useful if a kernel needs to be professionally examined or if you just want to test your dump device. You could, for example, run

/etc/sysdump -i /dev/mem -n /unix -o /dev/swap
 

Bela Lubkin commented on that:


This is a decent test, but!  If you do this to a running system which is
actually swapping, you will have a big problem.  You should have them
run `swap -l` first to make sure swap is idle.  You could then go into a
big discussion of what to do if it isn't, but I think that would be a
distraction.  (If it isn't: could run `swap -d /dev/swap` and see if
that works -- it will start shoving stuff in from swap, and it'll
succeed if there's enough RAM, as would be the case if you swapped once
for a little while but current memory requirements fit within RAM.
Otherwise they could make a swap _file_ and `swap -a` it, then `swap -d
/dev/swap` and it'll shuffle stuff from one to the other.)

But as I said, a distraction.  All you really need is to warn them --
don't try this if swap is in use ("blocks" != "free" in `swap -l
 

You specify where dump is to go by a "dump" keyword (dump=/dev/mydump) in /etc/default/boot or passed on the boot command line. Unlike swap, dump can't span multiple devices or use a file in the filesystem. You can also say "dump=none" if you definitely don't want to save anything.

See http://aplawrence.com/cgi-bin/ta.pl?arg=105935 for more information.

If you really want to test by creating an actual panic, see http://aplawrence.com/cgi-bin/ta.pl?arg=103679.

Swap is more complex. Swap actually has two purposes: to store processes if and when the kernel gets so low on free memory that it has to swap them out, and to serve as backing store for virtual memory. And that is a concept that is often misunderstood. For example, it is "common knowledge" that you need as much swap as you have memory, even if you have a separate dump device. It turns out that that is not true, at least not on OSR5 systems.

As I write this, my machine has 128 MB of ram and just 1 MB of swap configured. Here's "swap -l" and "memsize":

# swap -l
path               dev  swaplo blocks   free
/dev/swap          1,41      0   2000   2000
# memsize
129626112
 

You might think I couldn't do much with that configuration, but that's not the case. Here's what's happening right now:

Skip to the end of this output
# ps -el
  F S    UID   PID  PPID  C PRI NI     ADDR   SZ     WCHAN     TTY        TIME CMD
 71 S      0     0     0  0  95 20 fb117000    0  f03ab55c       ?    00:00:00 sched
 20 S      0     1     0  0  66 20 fb117158  140  e0000000       ?    00:00:01 init
 71 S      0     2     0  0  95 20 fb1172b0    0  f01a7aa0       ?    00:00:00 vhand
 71 S      0     3     0  0  95 20 fb117408    0  f001f820       ?    00:00:00 bdflush
 71 S      0     4     0  0  95 20 fb117560    0  f0228004       ?    00:00:00 kmdaemon
 71 S      0     5     1  0  95 20 fb1176b8    0  c01b0150       ?    00:00:02 htepi_daemon
 71 S      0     6     0  0  95 20 fb117810    0  f02ef708       ?    00:00:00 strd
 20 S      0   752     1  0  73 20 fb117968  160  fb117968   tty01    00:00:00 login
 20 S      0    48     1  0  76 20 fb117ac0  108  f0240ce4       ?    00:00:00 syslogd
 20 S      0    52     1  0  73 20 fb117c18  340  fb117c18       ?    00:00:00 ifor_pmd
 20 S      0    53    52  0  76 20 fb117d70  448  f0240ce4       ?    00:00:01 ifor_pmd
 71 S      0    41     1  0  95 20 fb117ec8    0  c102b150       ?    00:00:00 htepi_daemon
 20 S      0    79     1  0  75 24 fb118020   52  fce282d0       ?    00:00:00 strerr
 20 S      0    92     1  0  76 20 fb118178  956  f0240ce4       ?    00:00:17 agent
 20 S      0    61    53  0  76 20 fb1182d0  284  f0240ce4       ?    00:00:00 sco_cpd
 20 S      0    62    53  0  76 20 fb118428  332  f0240ce4       ?    00:00:00 ifor_sld
 20 S      0   700     1  0  76 24 fb118580  556  f0240ce4       ?    00:00:00 httpd
 20 S     28   753   722  0  75 24 fb1186d8  100  fce2a0e8       ?    00:00:00 dnsserver
 20 S     28   749   722  0  75 24 fb118830  100  fce2a448       ?    00:00:00 dnsserver
 20 S      0   390     1  0  76 20 fb118988  144  fc1ffb56       ?    00:00:00 cron
 20 S     28   701   700  0  75 24 fb118ae0  560  fce2a7a8       ?    00:00:00 httpd
 20 S      0   253     1  0  66 20 fb118c38  156  e0000000       ?    00:00:00 pwrd
 20 S      0   435     1  0  76 20 fb118d90  256  f0240ce4       ?    00:00:00 pppd
 71 S      0   190     1  0  95 20 fb118ee8    0  c102f150       ?    00:00:00 htepi_daemon
 71 S      0   194     1  0  95 20 fb119040    0  c1033150       ?    00:00:00 htepi_daemon
 71 S      0   198     1  0  95 20 fb119198    0  c1037150       ?    00:00:00 htepi_daemon
 71 S      0   202     1  0  95 20 fb1192f0    0  c103b150       ?    00:00:00 htepi_daemon
 20 S      0   255   253  0  76 20 fb119448   64  fc1fc7c6       ?    00:00:00 listen
 20 S      0   264     1  0  76 20 fb1195a0   64  f0240ce4       ?    00:00:00 dlpid
 20 S      0   529   528  0  76 20 fb1196f8 1060  f0240ce4       ?    00:00:00 ns-admin
 20 S      0   405     1  0  76 20 fb119850  216  fc1f55b8       ?    00:00:01 lpsched
 20 S      0   434     1  0  76 24 fb1199a8  140  f0240ce4       ?    00:00:00 inetd
 20 S      0   436   435  0  76 20 fb119b00  284  f0240ce4       ?    00:00:00 pppd
 20 S      0  1415   729  0  66 24 fb119c58   56  e0000000       ?    00:00:00 sleep
 20 S      0   445     1  0  76 24 fb119db0  108  f0240ce4       ?    00:00:00 lpd
 20 S      0   823   436  0  66 20 fb119f08  284  e0000000       ?    00:00:00 pppd
 20 S      0   457     1  0  76 24 fb11a060  304  f0240ce4       ?    00:00:00 snmpd
 20 S      0   999   755  0  73 20 fb11a1b8   68  fb11a1b8   tty03    00:00:00 sh
 20 S     17   462     1  0  66 20 fb11a310  136  e0000000       ?    00:00:00 deliver
 20 S      0   530   529  0  75 20 fb11a468  660  fce2ad48       ?    00:00:00 ns-admin
 20 S      0  1006   999  0  73 20 fb11a5c0   68  fb11a5c0   tty03    00:00:00 sh
 20 S      0   528     1  0  66 20 fb11a718  388  e0000000       ?    00:00:00 ns-admin
 20 S      0   669   665  0  76 20 fb11a870  676  f0240ce4       ?    00:00:00 vfsd
 20 S      0   665     1  0  76 20 fb11a9c8  676  f0240ce4       ?    00:00:00 vfsd
 20 S      0  1007  1006  0  73 20 fb11ab20   88  fb11ab20   tty03    00:00:00 edge.nightly
 20 S      0   563     1  0  76 24 fb11ac78  240  fc206406       ?    00:00:01 logsrv
 20 S      0   664     1  0  76 18 fb11add0  200  f0240ce4       ?    00:00:00 vfslockd
 20 S      0   670   669  0  76 20 fb11af28  676  f0240ce4       ?    00:00:00 vfsd
 20 R      0  1618   799  2  76 20 fb11b080  236         -   tty01    00:00:01 vi
 20 S     28   722     1  0  76 24 fb11b1d8  692  f0240ce4       ?    00:00:01 squid
 20 S      0   711     1  0  76 24 fb11b330  196  fc205e8e       ?    00:00:00 calserver
 20 S      0   713   711  0  76 24 fb11b488  204  fc205c36       ?    00:00:00 calserver
 20 S      0   715     1  0  76 24 fb11b5e0   88  fc209ab6       ?    00:00:00 caldaemon
 20 S      0   696     1  0  76 24 fb11b738  320  f0240ce4       ?    00:00:00 scohttpd
 20 S      0   729     1  0  73 24 fb11b890   60  fb11b890       ?    00:00:00 sh
 20 S     28   750   722  0  75 24 fb11b9e8  100  fce2a328       ?    00:00:00 dnsserver
 20 R      0  1138  1007 30  39 20 fb11bb40 8720         -   tty03    00:14:53 edge
 20 S     28   738   722  0  75 24 fb11bc98  100  fce2a568       ?    00:00:00 dnsserver
 20 S    201  1063   754  0  73 20 fb11bdf0   60  fb11bdf0   tty02    00:00:00 sh
 20 S     28   751   722  0  75 24 fb11bf48  100  fce2a208       ?    00:00:00 dnsserver
 20 S      0   754     1  0  73 20 fb11c0a0  160  fb11c0a0   tty02    00:00:00 login
 20 S      0   755     1  0  73 20 fb11c1f8  160  fb11c1f8   tty03    00:00:00 login
 20 S      0   756     1  0  75 20 fb11c350  124  f02289f4   tty04    00:00:00 getty
 20 S      0   757     1  0  75 20 fb11c4a8  124  f0228a5c   tty05    00:00:00 getty
 20 S     28   758   722  0  76 24 fb11c600  136  f0240ce4       ?    00:00:00 ftpget
 20 S      0   759     1  0  75 20 fb11c758  124  f0228ac4   tty06    00:00:00 getty
 20 S      0   760     1  0  75 20 fb11c8b0  124  f0228b94   tty08    00:00:00 getty
 20 S      0   761     1  0  75 20 fb11ca08  124  f0228bfc   tty09    00:00:00 getty
 20 S      0   762     1  0  75 20 fb11cb60  124  f0228c64   tty10    00:00:00 getty
 20 S      0   763     1  0  75 20 fb11ccb8  124  f0228ccc   tty11    00:00:00 getty
 20 S      0   764     1  0  75 20 fb11ce10  124  f0228d34   tty12    00:00:00 getty
 20 S     28   765   722  0  76 24 fb11cf68   48  f083d640       ?    00:00:00 unlinkd
 20 S      0   766     1  0  81 20 fb11d0c0  104  fc20a4e0       ?    00:00:00 sdd
 20 S      0   792   752  0  73 20 fb11d218   68  fb11d218   tty01    00:00:00 sh
 20 S      0   799   792  0  73 20 fb11d370  128  fb11d370   tty01    00:00:00 ksh
 20 S    201  1076  1063  0  73 20 fb11d4c8   68  fb11d4c8   tty02    00:00:00 sh
 20 S    201  1080  1076  0  73 20 fb11d620  212  fb11d620   tty02    00:00:00 xinit
 20 S    201  1081  1080  0  76  0 fb11d778 11788  f0240ce4   tty02    00:03:24 Xsco
 20 S    201  1083  1081  0  76  0 fb11d8d0 1112  f09056c0   tty02    00:00:01 vbiosd
 20 S    201  1145  1088  0  76 20 fb11da28 2736  f0240ce4   tty02    00:00:15 xdt3_binary
 20 S    201  1205  1088  0  76 20 fb11db80  712  f0240ce4   tty02    00:00:01 pmwm
 20 S    201  1088  1080  0  76 20 fb11dcd8  504  f0240ce4   tty02    00:00:00 scosession
 20 S    201  1206  1145  2  76 20 fb11de30 13056  f0240ce4   tty02    00:02:00 netscape-expor
 30 S      0  1179  1138  1  81 20 fb11df88  612  f031fd64   tty03    00:00:31 edge
 20 S    201  1212  1206  0  76 20 fb11e0e0 3940  f0240ce4   tty02    00:00:00 netscape-expor
 20 S     28  1663  1662  0  75 20 fb11e238  468  fce2cd10       ?    00:00:00 httpd
 20 S     28  1664  1662  0  75 20 fb11e390  468  fce2cd10       ?    00:00:00 httpd
 20 S     28  1665  1662  0  75 20 fb11e4e8  468  fce2cd10       ?    00:00:00 httpd
 20 S     28  1666  1662  0  75 20 fb11e640  468  fce2cd10       ?    00:00:00 httpd
 20 S      0  1662     1  0  76 20 fb11e798  452  f0240ce4       ?    00:00:00 httpd

 20 S      0  1668  1618  3  73 20 fb11ea48   60  fb11ea48   tty01    00:00:00 sh
 20 O      0  1669  1668  8  48 20 fb11eba0  148         -   tty01    00:00:00 ps
 

Notice that there's an X session with Netscape running on tty02 and an Edge backup going on on tty03, plus this editing on tty01, plus all the system stuff including Squid and Apache! Another interesting thing to note is the sum of the "size" column:

# ps -e -o size| awk '{ sum += $1
}
END { print "Sum", sum }'
Sum 62256
 

There's over 62 megabytes worth of programs being run right now; "sar -r" confirms that:

# sar -r 1 1
SCO_SV scobox 3.2v5.0.4 Pentium    09/30/99

12:14:56 freemem freeswp (-r)
12:14:57   13942    2000
 

Memory is expressed in 4k pages here and swap is 512 byte blocks, so that's 57,106,432 bytes of free memory and 1 MB of swap.

This happens to be a 5.0.4 machine. On 5.0.5, sar -r shows two other colums- we'll get to those shortly.

There are 13942 free pages, which means that (13942 * 4096) 57,106,432 bytes are free. The system started out (after loading the kernel and allocating its buffers and variables) with 27,589 pages.

You get that from availrmem- on 5.0.5 that shows up in sar -r, on older systems you can do:

# echo "od -d availrmem" | crash
dumpfile = /dev/mem, namelist = /unix, outfile = stdout
f020e120:  0000027589   

 

The figure doesn't remain completely constant, but will remain close to the same amount most of the time.

As 27589 - 13942 = 13647, and that times 4096 is 55,898,112, obviously some memory usage changed between the times I did these samples (the total available user memory minus the currently free pages should be the memory in use). This script tries to get closer to making everything agree:

echo "od -d freemem" | crash &
ps -e -o size| awk '{ sum += $1
}
END { print "Sum", sum }'
 

That "freemem" is the same thing sar -r reports. I put the "crash" session in background so that it has a chance of being included in the ps output; the results are:

Sum 66476
dumpfile = /dev/mem, namelist = /unix, outfile = stdout
f020e118:  0000012961   
 

That comes out closer, but you'll never get exact on a system that's working while you are measuring. The important point is that here we have a system with 60 megabytes or so of memory in use, and it is running quite happily with 1 MB of swap. Why?

The confusion of the "common knowledge" is due to the fact that virtual memory is what is really important, and virtual memory is the sum of available memory and available swap. With nothing running, that would be availrmem plus swap. Obviously you never have nothing running, so the sum of available ram and available swap is kept track of in a kernel variable "availsmem" (note "availSmem" vs. "availRmem"). Prior to R5.0.5, the only way to find out what the value of it was at any time was to run

echo "od -d availsmem" | crash
 

Starting with 5.0.5, "sar -r" lists availsmem and availrmem (amount of ram not being used by the kernel). So lets do some testing to see what happens here when we ask for even more memory. First a little shell script:

# cat once
#!/bin/sh
# "once"
echo availsmem freeswap freemem
echo "od -d availsmem
od -d freeswap
od -d freemem" | crash
swap -l
ps -l
# ./once
availsmem freeswap freemem
dumpfile = /dev/mem, namelist = /unix, outfile = stdout
f020e124:  0000012163   
f020e11c:  0000000250   
f020e118:  0000012795   
path               dev  swaplo blocks   free
/dev/swap          1,41      0   2000   2000
  F S    UID   PID  PPID  C PRI NI     ADDR   SZ     WCHAN     TTY        TIME CMD
 20 S      0   752     1  0  73 20 fb117968  160  fb117968   tty01    00:00:00 login
 20 S      0  1618   799  1  76 20 fb11b080  240  f0f20300   tty01    00:00:04 vi
 20 S      0   792   752  0  73 20 fb11d218   68  fb11d218   tty01    00:00:00 sh
 20 S      0   799   792  0  73 20 fb11d370  128  fb11d370   tty01    00:00:00 ksh
 20 S      0  2009  1618  1  73 20 fb11f258   60  fb11f258   tty01    00:00:00 sh
 20 S      0  2010  2009 29  73 20 fb11f3b0   60  fb11f3b0   tty01    00:00:00 sh
 20 O      0  2014  2010  6  48 20 fb11f508  148         -   tty01    00:00:00 ps
 

That will give us a quick snapshot of what's happening. Now lets write some C programs to use some memory. The first allocates a 2 MB buffer on its stack, the second uses a static array. Both of them call the "once" script three times while running:

# cat stackarray.c
/* stackarray.c */
#include <stdlib.h>
main()
{
system("./once");
memfunc();
outfunc();
exit(0);
}
outfunc() {
        system("./once");
}
memfunc()
{
char array[2 * 1024 * 1024];
        outfunc();
}

# cat staticarray.c

/* staticarray.c
#include <stdlib.h>
main()
{
system("./once");
memfunc();
outfunc();
exit(0);
}
outfunc() {
        system("./once");
}
memfunc()
{
static char array[2 * 1024 * 1024];
        outfunc();
}

# cc -o staticarray staticarray.c
# cc -o stackarray stackarray.c
# size stackarray.c staticarray
stackarray: 26396 + 4312 + 440 = 31148
staticarray: 26392 + 4312 + 2097592 = 2128296
 

Note the difference between these in the last (.bss column). That's because the 2 MB array won't be setup until stackarray runs.

# ./stackarray
availsmem freeswap freemem
dumpfile = /dev/mem, namelist = /unix, outfile = stdout
f020e124:  0000014629   
f020e11c:  0000000250   
f020e118:  0000015313   
path               dev  swaplo blocks   free
/dev/swap          1,41      0   2000   2000
  F S    UID   PID  PPID  C PRI NI     ADDR   SZ     WCHAN     TTY        TIME CMD
 20 S      0  2067  2041  0  73 20 fb11b080   16  fb11b080   ttyp1    00:00:00 stackarray
 20 S      0  2068  2067  0  73 20 fb11f258   60  fb11f258   ttyp1    00:00:00 sh
 20 S      0  2041  2040  0  73 20 fb11f660   60  fb11f660   ttyp1    00:00:00 sh
 20 S      0  2069  2068  9  73 20 fb11f7b8   60  fb11f7b8   ttyp1    00:00:00 sh
 20 O      0  2073  2069  8  48 20 fb11f910  148         -   ttyp1    00:00:00 ps
 

Here we see that stackarray has only used 16K of memory (size) when it first loads.


availsmem freeswap freemem
dumpfile = /dev/mem, namelist = /unix, outfile = stdout
f020e124:  0000014117   
f020e11c:  0000000250   
f020e118:  0000015312   
path               dev  swaplo blocks   free
/dev/swap          1,41      0   2000   2000
  F S    UID   PID  PPID  C PRI NI     ADDR   SZ     WCHAN     TTY        TIME CMD
 20 S      0  2067  2041 10  73 20 fb11b080 2064  fb11b080   ttyp1    00:00:00 stackarray
 20 S      0  2074  2067  1  73 20 fb11f258   60  fb11f258   ttyp1    00:00:00 sh
 20 S      0  2041  2040  0  73 20 fb11f660   60  fb11f660   ttyp1    00:00:00 sh
 20 S      0  2075  2074 21  73 20 fb11f7b8   60  fb11f7b8   ttyp1    00:00:00 sh
 20 O      0  2079  2075  7  48 20 fb11f910  148         -   ttyp1    00:00:00 ps
 

After the function is called, memory usage goes up to 2064K, and notice that availsmem goes down accordingly (14117 vs 14629). But "freemem" stays about the same, because we really haven't done anything with those pages yet- they are allocated, which affects availsmem, but no physical RAM has been assigned to them, and won't be unless and until we write something into them- which we don't in this test.

availsmem freeswap freemem
dumpfile = /dev/mem, namelist = /unix, outfile = stdout
f020e124:  0000014117   
f020e11c:  0000000250   
f020e118:  0000015312   
path               dev  swaplo blocks   free
/dev/swap          1,41      0   2000   2000
  F S    UID   PID  PPID  C PRI NI     ADDR   SZ     WCHAN     TTY        TIME CMD
 20 R      0  2067  2041 20  43 20 fb11b080 2064         -   ttyp1    00:00:00 stackarray
 20 S      0  2080  2067  2  73 20 fb11f258   60  fb11f258   ttyp1    00:00:00 sh
 20 S      0  2041  2040  0  73 20 fb11f660   60  fb11f660   ttyp1    00:00:00 sh
 20 R      0  2081  2080 18  44 20 fb11f7b8   60         -   ttyp1    00:00:00 sh
 20 O      0  2085  2081  6  48 20 fb11f910  148         -   ttyp1    00:00:00 ps
 

After the function returns, the space is still being shown as used, but of course it all comes back when the program exits:

# ./once
availsmem freeswap freemem
dumpfile = /dev/mem, namelist = /unix, outfile = stdout
f020e124:  0000014657   
f020e11c:  0000000250   
f020e118:  0000015345   
path               dev  swaplo blocks   free
/dev/swap          1,41      0   2000   2000
  F S    UID   PID  PPID  C PRI NI     ADDR   SZ     WCHAN     TTY        TIME CMD
 20 R      0  2086  2041 30  39 20 fb11b080   60         -   ttyp1    00:00:00 sh
 20 O      0  2090  2086  7  48 20 fb11f258  148         -   ttyp1    00:00:00 ps
 20 S      0  2041  2040  4  73 20 fb11f660   60  fb11f660   ttyp1    00:00:00 sh
 

Now lets try the static array:

# ./staticarray
availsmem freeswap freemem
dumpfile = /dev/mem, namelist = /unix, outfile = stdout
f020e124:  0000014117   
f020e11c:  0000000250   
f020e118:  0000015312   
path               dev  swaplo blocks   free
/dev/swap          1,41      0   2000   2000
  F S    UID   PID  PPID  C PRI NI     ADDR   SZ     WCHAN     TTY        TIME CMD
 20 S      0  2091  2041  2  73 20 fb11b080 2064  fb11b080   ttyp1    00:00:00 staticarray
 20 S      0  2092  2091  2  73 20 fb11f258   60  fb11f258   ttyp1    00:00:00 sh
 20 S      0  2041  2040  1  73 20 fb11f660   60  fb11f660   ttyp1    00:00:00 sh
 20 R      0  2093  2092 31  39 20 fb11f7b8   60         -   ttyp1    00:00:00 sh
 20 O      0  2097  2093  7  48 20 fb11f910  148         -   ttyp1    00:00:00 ps
 

The immediate difference is that the memory use shows up right away as we'd expect. Still no usage of real RAM, and for the same reason.

availsmem freeswap freemem
dumpfile = /dev/mem, namelist = /unix, outfile = stdout
f020e124:  0000014117   
f020e11c:  0000000250   
f020e118:  0000015312   
path               dev  swaplo blocks   free
/dev/swap          1,41      0   2000   2000
  F S    UID   PID  PPID  C PRI NI     ADDR   SZ     WCHAN     TTY        TIME CMD
 20 R      0  2091  2041 22  42 20 fb11b080 2064         -   ttyp1    00:00:00 staticarray
 20 S      0  2098  2091  1  73 20 fb11f258   60  fb11f258   ttyp1    00:00:00 sh
 20 S      0  2041  2040  0  73 20 fb11f660   60  fb11f660   ttyp1    00:00:00 sh
 20 R      0  2099  2098 17  45 20 fb11f7b8   60         -   ttyp1    00:00:00 sh
 20 O      0  2103  2099  7  48 20 fb11f910  148         -   ttyp1    00:00:00 ps
availsmem freeswap freemem
dumpfile = /dev/mem, namelist = /unix, outfile = stdout
f020e124:  0000014117   
f020e11c:  0000000250   
f020e118:  0000015312   
path               dev  swaplo blocks   free
/dev/swap          1,41      0   2000   2000
  F S    UID   PID  PPID  C PRI NI     ADDR   SZ     WCHAN     TTY        TIME CMD
 20 R      0  2091  2041 50  32 20 fb11b080 2064         -   ttyp1    00:00:00 staticarray
 20 S      0  2104  2091  3  73 20 fb11f258   60  fb11f258   ttyp1    00:00:00 sh
 20 S      0  2041  2040  0  73 20 fb11f660   60  fb11f660   ttyp1    00:00:00 sh
 20 R      0  2105  2104 28  40 20 fb11f7b8   60         -   ttyp1    00:00:00 sh
 20 O      0  2109  2105  5  49 20 fb11f910  148         -   ttyp1    00:00:00 ps
# 
 

Note in all of this, we still only had 1 MB of swap to work with. In these programs alone we allocated more than 2 MB of space, not even counting the 50 or 60 megabytes being used for other programs. This proves that you do not need swap for virtual memory if you have sufficient real memory. Also note that "swap -l" never changes, because no swap has been used (swap -l wouldn't show you vm usage anyway).

What happens if we turn things upside down? To find out, I put swap back at 128 MB, and forced memory to 48 MB by typing

mem=0k-639k,1m-16m,16m-48m/s/n
 

at the boot prompt before booting. The "once" program shows this before starting up anything other than the single login:

# ./once
availsmem freswap freemem
dumpfile = /dev/mem, namelist = /unix, outfile = stdout
f020e124:  0000036463   
f020e11c:  0000032000   
f020e118:  0000005219   
path               dev  swaplo blocks   free
/dev/swap          1,41      0 256000 256000
  F S    UID   PID  PPID  C PRI NI     ADDR   SZ     WCHAN     TTY        TIME CMD
 20 S      0   809   808  1  73 20 fb11dcd8   60  fb11dcd8   ttyp0    00:00:00 sh
 20 S      0   817   809 33  73 20 fb11de30   60  fb11de30   ttyp0    00:00:00 sh
 20 O      0   821   817  5  49 20 fb11e238  148         -   ttyp0    00:00:00 ps
# 
 

No swapping, 20 meg or so free. Now start up X and Netscape:

# ./once
availsmem freswap freemem
dumpfile = /dev/mem, namelist = /unix, outfile = stdout
f020e124:  0000030911   
f020e11c:  0000031946   
f020e118:  0000000039   
path               dev  swaplo blocks   free
/dev/swap          1,41      0 256000 255576
  F S    UID   PID  PPID  C PRI NI     ADDR   SZ     WCHAN     TTY        TIME CMD
 20 S      0   940   809 17  73 20 fb119f08   60  fb119f08   ttyp0    00:00:00 sh
 20 S      0   809   808  0  73 20 fb11dcd8   60  fb11dcd8   ttyp0    00:00:00 sh
 20 O      0   944   940  9  48 20 fb11ea48  148         -   ttyp0    00:00:00 ps
 

It had to use a little swap to get Netscape up.

# ./stackarray
availsmem freswap freemem
dumpfile = /dev/mem, namelist = /unix, outfile = stdout
f020e124:  0000030873   
f020e11c:  0000031944   
f020e118:  0000000044   
path               dev  swaplo blocks   free
/dev/swap          1,41      0 256000 255552
  F S    UID   PID  PPID  C PRI NI     ADDR   SZ     WCHAN     TTY        TIME CMD
 20 S      0   946   809  3  73 20 fb119f08   16  fb119f08   ttyp0    00:00:00 stackarray
 20 S      0   809   808  2  73 20 fb11dcd8   60  fb11dcd8   ttyp0    00:00:00 sh
 20 S      0   947   946  3  73 20 fb11ea48   60  fb11ea48   ttyp0    00:00:00 sh
 20 S      0   948   947 31  73 20 fb11ecf8   60  fb11ecf8   ttyp0    00:00:00 sh
 20 O      0   952   948  6  48 20 fb11ee50  148         -   ttyp0    00:00:00 ps
availsmem freswap freemem
dumpfile = /dev/mem, namelist = /unix, outfile = stdout
f020e124:  0000030361   
f020e11c:  0000031944   
f020e118:  0000000043   
path               dev  swaplo blocks   free
/dev/swap          1,41      0 256000 255552
  F S    UID   PID  PPID  C PRI NI     ADDR   SZ     WCHAN     TTY        TIME CMD
 20 S      0   946   809 22  73 20 fb119f08 2064  fb119f08   ttyp0    00:00:00 stackarray
 20 S      0   809   808  1  73 20 fb11dcd8   60  fb11dcd8   ttyp0    00:00:00 sh
 20 S      0   953   946  1  73 20 fb11ea48   60  fb11ea48   ttyp0    00:00:00 sh
 20 R      0   954   953 26  41 20 fb11ecf8   60         -   ttyp0    00:00:00 sh
 20 O      0   958   954  7  48 20 fb11ee50  148         -   ttyp0    00:00:00 ps
availsmem freswap freemem
dumpfile = /dev/mem, namelist = /unix, outfile = stdout
f020e124:  0000030361   
f020e11c:  0000031944   
f020e118:  0000000043   
path               dev  swaplo blocks   free
/dev/swap          1,41      0 256000 255552
  F S    UID   PID  PPID  C PRI NI     ADDR   SZ     WCHAN     TTY        TIME CMD
 20 R      0   946   809 58  29 20 fb119f08 2064         -   ttyp0    00:00:00 stackarray
 20 S      0   809   808  1  73 20 fb11dcd8   60  fb11dcd8   ttyp0    00:00:00 sh
 20 S      0   959   946  3  73 20 fb11ea48   60  fb11ea48   ttyp0    00:00:00 sh
 20 R      0   960   959 30  39 20 fb11ecf8   60         -   ttyp0    00:00:00 sh
 20 O      0   964   960  6  48 20 fb11ee50  148         -   ttyp0    00:00:00 ps
 

Notice that swap changes a little bit, but stays the same as availsmem goes down. This shows that "swap -l" means nothing with regard to availsmem- these are entirely separate statistics.

But what if you use the memory?

Now lets try something else. We'll modify the "stackarray" code so that it actually uses the memory:

/* stackarray.c with actual use of array */
#include <stdlib.h>
main()
{
system("./once"); 
memfunc();
outfunc();
}
outfunc() {
        system("./once");
}
memfunc()
{
int x;
char array[2 * 1024 * 1024];
        outfunc();
        for (x=0; x < 2 * 1024 * 1024; x+= 4096) {
                array[x]=x;
        }
}


When we run it, there's an interesting difference: notice that "freemem" goes down after the memory is actually used, but "availsmem" remains the same throughout. That's because until we actually put something in the array, it's just pointers to virtual memory- no real memory gets allocated until we really need it. This run is with 128 MB of memory and 128 MB of swap, but it shows what actually happens (there is no difference when run with 1 MB of swap- only the "swap" figures change):

availsmem freswap freemem
dumpfile = /dev/mem, namelist = /unix, outfile = stdout
f020e124:  0000047704   
f020e11c:  0000032000   
f020e118:  0000016512   
path               dev  swaplo blocks   free
/dev/swap          1,41      0 256000 256000
  F S    UID   PID  PPID  C PRI NI     ADDR   SZ     WCHAN     TTY        TIME CMD
 20 S      0   777     1  0  73 20 fb118988  160  fb118988   tty03    00:00:00 login
 20 S      0  1344   777  0  73 20 fb11b080   68  fb11b080   tty03    00:00:00 sh
 20 S      0  1351  1344  1  73 20 fb11e390  128  fb11e390   tty03    00:00:01 ksh
 20 S      0  2190  1351  1  73 20 fb11e798   16  fb11e798   tty03    00:00:00 stackarray
 20 S      0  2191  2190  1  73 20 fb11f3b0   60  fb11f3b0   tty03    00:00:00 sh
 20 R      0  2192  2191 21  43 20 fb11f508   60         -   tty03    00:00:00 sh
 20 O      0  2196  2192  6  48 20 fb11f660  148         -   tty03    00:00:00 ps
availsmem freswap freemem
dumpfile = /dev/mem, namelist = /unix, outfile = stdout
f020e124:  0000047192   
f020e11c:  0000032000   
f020e118:  0000016511   
path               dev  swaplo blocks   free
/dev/swap          1,41      0 256000 256000
  F S    UID   PID  PPID  C PRI NI     ADDR   SZ     WCHAN     TTY        TIME CMD
 20 S      0   777     1  0  73 20 fb118988  160  fb118988   tty03    00:00:00 login
 20 S      0  1344   777  0  73 20 fb11b080   68  fb11b080   tty03    00:00:00 sh
 20 S      0  1351  1344  1  73 20 fb11e390  128  fb11e390   tty03    00:00:01 ksh
 20 R      0  2190  1351 31  39 20 fb11e798 2064         -   tty03    00:00:00 stackarray
 20 S      0  2197  2190  4  73 20 fb11f3b0   60  fb11f3b0   tty03    00:00:00 sh
 20 R      0  2198  2197 30  39 20 fb11f508   60         -   tty03    00:00:00 sh
 20 O      0  2202  2198  6  48 20 fb11f660  148         -   tty03    00:00:00 ps
availsmem freswap freemem
dumpfile = /dev/mem, namelist = /unix, outfile = stdout
f020e124:  0000047192   
f020e11c:  0000032000   
f020e118:  0000016000   
path               dev  swaplo blocks   free
/dev/swap          1,41      0 256000 256000
  F S    UID   PID  PPID  C PRI NI     ADDR   SZ     WCHAN     TTY        TIME CMD
 20 S      0   777     1  0  73 20 fb118988  160  fb118988   tty03    00:00:00 login
 20 S      0  1344   777  0  73 20 fb11b080   68  fb11b080   tty03    00:00:00 sh
 20 S      0  1351  1344  1  73 20 fb11e390  128  fb11e390   tty03    00:00:01 ksh
 20 R      0  2190  1351 79  21 20 fb11e798 2064         -   tty03    00:00:00 stackarray
 20 S      0  2203  2190  3  73 20 fb11f3b0   60  fb11f3b0   tty03    00:00:00 sh
 20 R      0  2204  2203 29  40 20 fb11f508   60         -   tty03    00:00:00 sh
 20 O      0  2208  2204  6  48 20 fb11f660  148         -   tty03    00:00:00 ps
 

Note that this also means that if a program requests (allocates) but does not actually use memory (as was the case in the previous tests), you can have the strange circumstance where you have free memory (because no physical pages have been allocated) but you can't run any more programs because you have run out of virtual memory (availsmem). That doesn't mean you need more swap, it means you need more availsmem- adding either more swap or more real RAM will fix the problem.

Of course, if you want to run that right now, adding swap is easier than adding memory. To add swap, you could just do:

touch /mynewswapfile
swap -a /mynewswapfile 256000

 
which instantly and magically adds another 128 MB of swap to your system. Note that's not permanent; you'd need to redo it at every boot.

So how much swap do you need? Who knows? What kind of programs are you running? How much data and stack do they need? How much dump space do you think you'll need? That's the only way you could try to really figure it out; most folks just add 50% to memory and hope for the best. But if the only use is for dump, does that make any sense? Will you just be adding 50% more memory?

With today's large hard drives, it doesn't cost much to configure more swap than you think you'll ever need. When does it become ridiculous though? If you currently run 128 MB of memory, should you size for potentially having 256 MB? 512? A gigabyte?

If you have a separate dump device, and lots of real memory, you may not need much swap at all. I think I'd always configure some, just in case there's some kernel code somewhere that expects it, but that may not even be necessary- in fact, it's darn unlikely. You should certainly understand (despite the "common knowledge" to the contrary) that virtual memory doesn't NEED swap- swap can and will be used for vm, but it isn't REQUIRED. As for dump space, it's needed if you ever need it AND you expect to be analyzing the results. Otherwise, it truly is wasted space, and while today's hard drives are inexpensive, you might have better use for that space.


Bela Lubkin was kind enough to make some comments and suggestions on this article which caused to me to rewrite a few sections of it trying to make things clearer. Whether I succeeded or not, I thought it would be good to add his actual comments here also, and he agreed to publish his email. What follows is extractions from those emails with explanatory comments from me in italics.

Here I had said that I hadn't stressed "freemem" in the original article because it didn't seem important to me in the context of writing about swap:

(Bela)

It is definitely important. freemem measures the amount of actual RAM that isn't currently in use, while availsmem measures the amount of virtual space that hasn't yet been promised to someone. availsmem is the upper bound on how much [measured in terms of memory usage] you can run at all. freemem is the upper bound on how much you can run without actually performing swap I/O, which is rather costly in performance. If you were to graph performance vs. memory usage, you would see something like this:


100% |====================================
     |                                   /=====
     |                                 (1)     =====
     |                                              =====
     |                                                  /==
     | (1) memory getting tight(*); kernel starts to  (2)  ==
     |     page non-dirty pages out of executable            ==
     |     binaries and other such read-only sources           ==
     |                                                           ==
     | (2) freemem approaching 0 (crosses GPGSHI), kernel         /=
     |     starts paging dirty pages out to swap                (3)=
     |                                                             =
     | (3) all inactive pages have been written out to swap(@);     =
     |     active pages start getting written out to swap;          =
     |     system starts to thrash                                  =
     |
  0% +-----------------------------------------------------------------------------

       (*) I'm not sure what the exact technical threshold is here.  It
           isn't GPGSHI, it isn't MINASMEM...

       (@) This isn't a technical threshold; more of a user tolerance
           threshold.  As you run more stuff, and as that stuff touches
           more of its memory more frequently, and as a higher
           percentage of that memory gets pushed out to swap,
           performance is going to degrade rapidly until the user finds
           it intolerable.

           Also, there's no real distinction between "active" and "not
           active" pages, in this context.  The question is, on average,
           how long will it be before this page that's being written out
           to swap will be needed in RAM again?  The kernel has
           strategies which make this average quite high when things
           aren't too tight.  As memory tightens, the average
           necessarily goes down.  When a significant portion of memory
           accesses actually become disk accesses, performance is
           extremely degraded; time to add more RAM.

freemem is important for performance; availsmem is important for being
able to run things at all -- quickly or not.

=============================================================================
 

The independence of these variables is also quite confusing. For instance, availsmem can approach 0 while freemem is still a large number. It's easy -- just run a lot of programs (like the examples in this question) which *allocate* a lot of memory, but never touch it. Suppose a system had 64MB RAM and 256MB swap. It would start out with availsmem around 80000 (*4K) and freemem around 15000. Now run 10 instances of a simple program that allocates 30MB of RAM, but doesn't touch it. These will decrement availsmem by about 300MB == 75000, leaving about 5000. But they won't take up an appreciable amount of real RAM, so freemem's still around 15000. Now try to run one more copy. There is still plenty of freemem; `sar -r` "freemem" looks fine. But you get EAGAIN because the program can't allocate another 7500 pages of availsmem.

What good is this mechanism doing? Well, what if all those programs *did* suddenly start touching their memory. The kernel would have to find actual backing store for those pages -- either RAM or swap. availsmem tells it how many pages of that backing store are not yet claimed. Thus, it can prevent a process from starting, which will require memory that might eventually not be available. The mechanism has an underlying assumption that processes will *not*, in general, allocate huge amounts of memory that they won't actually use. When that assumption is broken, the mechanism is over-protective -- it prevents you from using your RAM just because someone is hogging (and not using) address space.

I had responded with:

> And there's a common misunderstanding: most folks seem to
> believe that it has to be swap, and it doesn't.  If it did,
> I'd never be able to run 60 MB of programs when swap was 1
> MB.    I could be wrong, but I believe that the source of
> this is that older Unices (like Sun 4.x releases) actually
> DID require swap space, and couldn't use ram- but I don't
> have one of those anymore to mess with, so I can't be sure..

 

(Bela)

Some versions of Unix use static mappings of virtual space to swap space. The act of allocating virtual space (whether through malloc() (== [s]brk()), growing your stack, fork(), or initial mapping of a process's .data and .bss) also allocates matching pages of swap. In such a system, the total weight of processes that you can run at once equals your swap space. OSR5 doesn't bind virtual space to specific swap pages; as a result, it can use *all* of the potential backing store -- both RAM and swap -- to hold the combined weight of processes.

Then I asked:

> One more thing: suppose you actually had 4 gig of RAM.  I'd
> assume, given what I think I know about OSR5, that there
> would be no point in having swap (assuming separate dump) at
> all- that availsmem couldn't exceed 4 gig anyway?  Not
> anything  I can test with my cheap hardware!

 

availsmem is counted in 4K pages. 4GiB == 1048576. So availsmem itself doesn't limit (RAM + swap) to 4GiB.

I don't know the answer to your broader question. In principle, I see no reason you couldn't have multiple large swap spaces. No one swap area can be larger than 4GiB, but the total size of swap can be much larger. As long as the swap page number can be stored in a 32-bit integer, it should be usable. Of course, you could never have more than 4GiB worth of it in memory at once -- so if you were really using all that swap, the ratio of slow-access to fast-access pages would be bad -- performance would bite.

As long as I had him on the hook, I figured I might as well ask all my questions:

> Sun used to give a swap sizing guide for Solaris 2.x that
> went down as memory increased, and swap disappeared entirely
> eventually.  Is there anything in OSR5 that would assume the
> existence of a swap device and cause a problem if there
> wasn't any?

 

I don't know, but I've deliberately run test systems with no swap at all (boot keyword "swap=none") for long enough to feel reasonably safe about it.

Publish your articles, comments, book reviews or opinions here!

© September 1999 A.P. Lawrence. All rights reserved





If this page was useful to you, please help others find it:  





Comments?




More Articles by - Find me on Google+



Click here to add your comments
- no registration needed!



Don't miss responses! Subscribe to Comments by RSS or by Email

Click here to add your comments


If you want a picture to show with your comment, go get a Gravatar

Kerio Samepage


Have you tried Searching this site?

Unix/Linux/Mac OS X support by phone, email or on-site: Support Rates

This is a Unix/Linux resource website. It contains technical articles about Unix, Linux and general computing related subjects, opinion, news, help files, how-to's, tutorials and more. We appreciate comments and article submissions.

Publishing your articles here

Jump to Comments



Many of the products and books I review are things I purchased for my own use. Some were given to me specifically for the purpose of reviewing them. I resell or can earn commissions from the sale of some of these items. Links within these pages may be affiliate links that pay me for referring you to them. That's mostly insignificant amounts of money; whenever it is not I have made my relationship plain. I also may own stock in companies mentioned here. If you have any question, please do feel free to contact me.

I am a Kerio reseller. Articles here related to Kerio products reflect my honest opinion, but I do have an obvious interest in selling those products also.

Specific links that take you to pages that allow you to purchase the item I reviewed are very likely to pay me a commission. Many of the books I review were given to me by the publishers specifically for the purpose of writing a review. These gifts and referral fees do not affect my opinions; I often give bad reviews anyway.

We use Google third-party advertising companies to serve ads when you visit our website. These companies may use information (not including your name, address, email address, or telephone number) about your visits to this and other websites in order to provide advertisements about goods and services of interest to you. If you would like more information about this practice and to know your choices about not having this information used by these companies, click here.

pavatar.jpg

This post tagged:

       - Disks/Filesystems
       - Install/Upgrade
       - Kernel
       - SCO_OSR5















My Troubleshooting E-Book will show you how to solve tough problems on Linux and Unix systems!


book graphic unix and linux troubleshooting guide



Buy Kerio from a dealer
who knows tech:
I sell and support

Kerio Connect Mail server, Control, Workspace and Operator licenses and subscription renewals



Click and enter your name and phone number to call me about Kerio® products right now (Flash required)