From: Bela Lubkin <email@example.com> Subject: Re: Dying processes (inetd, cron, syslogd, sshd) Date: 8 Aug 2005 06:10:34 -0400 Message-ID: <firstname.lastname@example.org> References: <email@example.com>
<firstname.lastname@example.org> email@example.com wrote: > Anyone have any idea's on this problem? I posted on August 1st, but never saw it come back to me. This time I'm Bcc'ing you so you'll see it even if USENET swallows it again... firstname.lastname@example.org wrote: > We are having problems on various SMP machines (5.0.6a + rs506a > installed) where at times of large load most of the running processes > just seem to stop (e.g. inetd, cron, syslogd, sshd,....) This always > seems to occur at times of large stress to the disks, but we have never > managed to put our fingers on exactly what is causing it. When it does > happen not only does the inetd process die, but also cron and syslog > which makes it very tricky for us to put anything in place to try and > catch what is happening. > > We are able to ping the machine when it does happed and also login at > the console and over a modem but not over a telnet of ssh connection. > > We have had an issue open with SCO before who advised us to install > scodb and set it to trigger when the inetd process stops - and when it > does to get a sysdump. We have tried this, but the sysdump created was > too big for swap - do you know of any way from within scodb to reduce > the size of the sysdump created? > > This machine (which has had the problem once a day for the last three,) > is used as a backup server in our office. All that runs on it is two > rsync's of our main machine - one for mail/uucp spools, and one for the > main data. The problem always has occured during these rsyncs, normally > when transferring a large file. scodb can't reduce the size of a crash dump, but you can force the dumps to fit by limiting the amount of memory seen by the kernel. To do this, append " mem=1m-100m" to DEFBOOTSTR in /etc/default/boot (substituting a bit less than actual size of your dump area in place of "100m"). The load you describe would probably run in 12MB of RAM, but don't limit memory more than you have to. The problem might be memory size-related. You want to keep as much as you can of the machine's normal memory size. [new material begins] > What would be the outcome if you had one process that kept on wanting > more and more resource? There are some problem scenarios like that. A common one is a process spinning out of control, allocating more and more memory. It will eventually use all available memory; its next allocation attempt will fail, and in most cases it will then die. Unless you have changed the defaults, such a process usually writes a core dump. On OSR5, during the dumping of a process's core, the process continues to own all of its memory until the dump is complete. This means that the machine remains critically out of memory for a long time. The process may have grown nearly as large as your combined RAM + swap. To dump it, not only does the kernel have to write that much data, it also may have to page a large portion of it in from swap. This can take many minutes with large memory and a slow disk... During that period, other processes that try to allocate memory will usually fail. Their subsequent behavior depends on their error handling. Some will dump core, some will exit gracefully, some may even stay up. And some may get into weird catatonic states. > Do other processes hold onto the resource they have or will they > eventually get 'bullied' out of the resource they are using and > essentially stop (which theoretically would give the results I am > seeing.) For memory, a "hog" process will cause others to get written out to swap, but those processes still "own" their memory (it will get paged back in if they need to access it). The troubles happen when a process tries to allocate more memory while the system is strapped. There are probably other resources where similar things could happen. > Any ideas? or does anyone have any idea's as to how I would track down > what was causing this to happen. If you had a process spin out and dump, it would leave a huge core file that you would be able to find. If a process spins out and dies _without_ leaving a dump, a more subtle trace is left. Normally, OSR5 doesn't use any swap at all; `swap -l` will have identical values in the "blocks" and "free" columns. ("Normal" modern systems have enough RAM that they never need to invoke the tremendous performance loss of swapping.) After such an incident, `swap -l` will show quite a bit of swap in use. This represents pages that got pushed out, and whose processes have never actually needed to access them since the incident. What does `crash` "p" show in the "EVENT" column for the hung processes? >Bela<
Got something to add? Send me email.