(OLDER) <- More Stuff -> (NEWER) (NEWEST)
Printer Friendly Version



Best of the Newsgroups: prngd killed all processes


Main Index

Path: border1.nntp.dca.giganews.com!nntp.giganews.com!newscon06.news.prodigy.com!prodigy.net!newsfeed-00.mathworks.com!enigma.xenitec.ca!jpradley.jpr.com!via-email
From: Bela Lubkin 
Newsgroups: comp.unix.sco.misc
Subject: SOLUTION, Re: Dying processes (inetd, cron, syslogd, sshd)
Date: 7 Nov 2005 21:04:23 -0500
Lines: 214
Sender: nouser@jpradley.jpr.com
Message-ID: <200511071804.aa21657@deepthought.armory.com>
References: <1122912804.902870.308720@g44g2000cwa.googlegroups.com> <1124115144.362415.307700@o13g2000cwo.googlegroups.com>
NNTP-Posting-Host: jpradley.jpr.com
X-Trace: jpradley.jpr.com 1131415463 24377 (None) 198.207.210.2
X-Complaints-To: news@jpradley.jpr.com
X-Mailer: Mail User's Shell (7.2.6 beta(5) 1998-10-07 + patches)
Xref: number1.nntp.dca.giganews.com comp.unix.sco.misc:167712

A few months ago, Keith Crymble of Actual Systems posted a puzzle.
Multiple systems running SCO OpenServer Release 5.0.6 + rs506a were
having an intermittent problem.  The symptom was that most of the
processes on the system would suddenly die without warning.

After some public discussion, Keith and I agreed that the most likely
way to solve the problem was for me to access the systems directly.  We
arranged to do this under the auspices of the company I am now working
for.  We also agreed that when the problem was solved, I would post the
story so that others might avoid the problem in the future.



So...

In brief, the problem was being caused by an old, buggy version of the
pseudo-random number generator, `prngd`.  The eventual solution was
simply to upgrade the machines to a more recent version of `prngd`.

Now for some details.  I will describe the discovery process, the actual
cause of the problem, the important details of the symptoms, and the
solution.

We arranged for me to have console access to two live backup machines
which were experiencing the problem.  Keith configured his firewall to
allow me to `ssh` in to a master machine.  From that system I could `cu`
to COM1 of each of the test machines.  Keith configured scodb into their
kernels, and booted them with COM1 as their consoles.  Now I had remote
live kernel debugger access.

I installed GNU `screen` on the master machine so that I could connect
and disconnect at will without losing console output from the test
machines.

I established kernel breakpoints to enter the debugger whenever certain
system processes (like `cron`, `inetd` etc.) died.  Then it was a matter
of waiting for an "event".



In the first event, the process was being killed by signal 9 (SIGKILL).
Many other processes had pending SIGKILL signals.  The timing made me
think that the problem was being caused by a single multi-process kill
rather than a series of individual kills.  The kill(S) system call
accepts special arguments of "-PGID" (negative process group ID) to kill
all processes in a particular group; or -1 to kill all processes in the
system.  Unfortunately, the process responsible for the call had
finished and gone on to other business; I didn't catch it in the act.
This might have been different on a single-CPU system, but Keith's test
systems were MP.  While I was examining the dying process on one CPU
(probably in the few milliseconds after the debugger prompt came up),
the culprit was finishing his dirty work on the other CPU.

Now that I thought it was a multi-process SIGKILL, I put a breakpoint
into kill(S) itself, essentially:

  if (signal is SIGKILL and process to kill is < 0 [multiple procs])
    breakpoint

This eventually triggered as well.  Previous evidence had lead me to
suspect the remote-command program, `rcmd` (`rsh` on other *ix systems).
`rcmd` was already implicated because, according to Keith, the events
always happened while large files were being copied across the network.
Also, I had run `truss` on it and observed that it used SIGKILL to kill
off its own child process.  I suspected a race condition where `rcmd`
was getting confused about its child process's ID and mistakenly doing
`kill(-1, 9)'.

So I was fairly surprised when the actual caller was `prngd`!  Sure
enough, it was calling `kill(-1, 9)'.  But why??

Keith's systems were running `prngd` version 0.9.6.  I found the
development site for `prngd`,

  http://www.aet.tu-cottbus.de/personen/jaenicke/postfix_tls/prngd.html

and its FTP repository,

  ftp://ftp.aet.tu-cottbus.de/pub/postfix_tls/related/prngd/

Fortunately there was an "old" subdirectory with 40 previous versions of
the `prngd` source.  This allowed me to see how this bug appeared and
later disappeared.

`prngd` is the pseudo-random number generator; on OpenServer, this is
primarily used by `ssh`.  (Interestingly, Keith had earlier mentioned
that the problem seemed to get worse when they used `ssh` instead of
`rcmd` for their large file copies.  Now that we've identified `prngd`
as the culprit, this makes sense...)  Cryptographical protocols tend to
use random numbers to make it harder to crack the encryption.  `prngd`
creates "pseudo" random numbers by running what it calls "entropy
gathering commands".  On OSR5, these are commands like `ps -efl`,
`netstat -in`, `df`, and `tail -200 /var/adm/syslog`.  The command list
is stored in /etc/prngd.conf.

Since `prngd` is portable to many operating systems, it has to deal with
all sorts of unusual conditions.  One of these is this: some time in the
past, it had run into entropy gathering commands which would sometimes
hang.  You can imagine that a command like `netstat -in`, which dives
into kernel networking structures, might have some obscure bugs.  In
order to protect itself from possible hangs, `prngd` monitors the
entropy gathering commands and kills them off if they run for too long.
(Since this appears in the first public release of `prngd`, it looks
like the author anticipated the hanging problem without necessarily
having seen it.)

The code for this in `prngd` 0.9.6 looked sort of like this:

  pid = (start the entropy gathering command, report its PID)
  ...
  if (too much time has gone by: entropy gathering command seems hung)
    if (pid != -1)
      kill(pid, SIGKILL)

It also set up a signal handler to receive SIGCHILD, which notifies a
parent process when its subprocess dies.  Some of the code in that
handler looked like this:

  pid = -1   /* note no entropy gathering command currently running */

This lead to a race condition.  The fatal sequence went like this:

  pid = (start the entropy gathering command, report its PID)
  ...
  if (too much time has gone by: entropy gathering command seems hung)
    if (pid != -1)

      /* at this moment, pid isn't -1 */

      /* also at this moment, the entropy gathering command
         finishes, so we enter the SIGCHILD handler */

      ...
      pid = -1
      ...

      /* back in the main code */

      kill(pid, SIGKILL)

The main code intended to kill off the one entropy gathering process,
but it got tricked into sending SIGKILL to PID -1.  Which means "kill
every process on the system".  Whoops.

The timing window between `if (pid != -1)' and `kill(pid, SIGKILL)' is
small.  You might think "that will never happen!".  People who program
in multithreaded / multitasking environments soon learn that this sort
of "race condition" _always_ eventually shows up.  The CPU is running
millions or billions of instructions per second and this window is only
a few instructions wide, but you can be sure it will eventually get hit.
If the consequences were less catastrophic, it might never be noticed.
For instance if the only consequence was that, once every few million
runs, an "entropy gathering process" would hang and never finish (but
`prngd` kept running), it might never have been fixed.

I started checking other versions of `prngd`.  The first public release
of `prngd` was 0.1.0, on 2000-07-03, and it already had this bug.  0.9.6
was published on 2001-02-19.  One short week later, 2001-02-26, version
0.9.8 was released -- including the fix for this bug.  (The current
version is 0.9.29, released 2004-07-12...)  The bug lasted about 7 1/2
months.

So this was an already known bug in a system daemon.  The tricky part
was tracking back from the symptoms to their cause.

Now, a word about the symptoms.  Keith had reported that "most of the
running processes just seem to stop".  Here's what was actually
happening.  When `prngd` called `kill(-1, 9)', this sent SIGKILL to
every _eligible_ process.  The kill(C) man page says:

"  If the effective user ID of the sender is root, send the
"  signal to all processes (except processes 0 and 1).

This is an imperfect description, because certain other processes are
also exempt.  The signal is not sent to any kernel processes.  On OSR5,
these include "sched", "vhand", "bdflush", "CPUn idle process",
"kmdaemon", "vddaemon", "strd", "htepi_daemon", "dtdaemon", and possibly
others.

Perhaps even more importantly, process 1 -- `init` -- is exempt.  One of
init's jobs is to restart certain processes if they die.  After an
"event" on one of Keith's systems, `init` restarted all the processes it
was responsible for: about a dozen `getty` processes (for the console
multiscreens), and the daemon starting daemon, `sdd`.

By the time a human got a look at the system, after an event which
theoretically killed "every" process on the system, there were about 20
processes running.

Now you too can recognize the symptoms of a "kill(-1, 9)" call...

The solution, of course, was very simple.  Keith installed a newer
version of `prngd`, and the problem hasn't happened once in the last
month.  It used to happen several times a week.

I recommend checking the `prngd` (sometimes called `in.prngd`) binaries
on all of your systems running any sort of *ix OS -- OpenServer,
UnixWare, Linux, Mac OS/X, BSD, whatever.  If you find a version earlier
than 0.9.8, upgrade to a more recent version.  (`prngd` version 0.9.10
came after 0.9.9.)  All *ix systems are vulnerable to some amount of
trouble from this bug.  Even if `prngd` is run under its own separate
user ID, it would still succeed in killing _itself_, thus shutting down
the pseudo random number service.

Given the age of the problematic versions, you won't find many of them
running around out there.  But it's still worth checking.

Likewise, if you ever have "nearly all" processes on a system die,
consider whether it might have been a global kill.  As we saw here, this
doesn't necessarily leave the process table completely blank.

I am available for Unix problem solving and security consultation
through IS-Data, LLC, a Santa Cruz consultancy specializing in network
security.  www.is-data.net.

>Bela<



 


Click here to add your comments





Wed Nov 9 15:57:59 2005: Subject:   anonymous


I guess I have been lucky! We are running 0.9.6, and have not had any issues. I will upgrade anyway, just in case. Very nice detective work here, and very well written. Thanks!

- Bruce


Don't miss responses! Subscribe to Comments by RSS or by Email

Click here to add your comments


If you want a picture to show with your comment, go get a Gravatar



/Bofcusm/2628.html copyright 1997-2004 (various authors) All Rights Reserved

Have you tried Searching this site?

Unix/Linux/Mac OS X support by phone, email or on-site: Support Rates

This is a Unix/Linux resource website. It contains technical articles about Unix, Linux and general computing related subjects, opinion, news, help files, how-to's, tutorials and more. We appreciate comments and article submissions.

Publishing your articles here

Jump to Comments



Many of the products and books I review are things I purchased for my own use. Some were given to me specifically for the purpose of reviewing them. I resell or can earn commissions from the sale of some of these items. Links within these pages may be affiliate links that pay me for referring you to them. That's mostly insignificant amounts of money; whenever it is not I have made my relationship plain. I also may own stock in companies mentioned here. If you have any question, please do feel free to contact me.

Specific links that take you to pages that allow you to purchase the item I reviewed are very likely to pay me a commission. Many of the books I review were given to me by the publishers specifically for the purpose of writing a review. These gifts and referral fees do not affect my opinions; I often give bad reviews anyway.

We use Google third-party advertising companies to serve ads when you visit our website. These companies may use information (not including your name, address, email address, or telephone number) about your visits to this and other websites in order to provide advertisements about goods and services of interest to you. If you would like more information about this practice and to know your choices about not having this information used by these companies, click here.



More:
       - OSR5
       - Bofcusm


Unix/Linux Consultants

Skills Tests

Guest Post Here











My Favorites

Change Congress