fork and wait

Wed Jan 17 15:40:15 CET 2007

code calls rad_fork(1) (such as rad_check_ts())

This means that no entry is stored in thread_pool.waiters

This means that rad_waitpid() calls reap_children() which means that 
subsequent calls to waitpid() -- since there is no thread_pool.waiters 
entry for the pid -- return an ECHILD, which leads rad_check_ts() to 
print this message

radlog(L_ERR, "Check-TS: unknown error in waitpid()");

So basically rad_fork(1) and rad_waitpid() and waitpid() are now unreliable.

Furthermore, using rad_fork(0) without rad_waitpid(pid,...,...) will 
eventualy result in a full thread_pool.waiters hashtable and calls to 
rad_fork(0) will fail.

So waitpid(0,...,...) in reap_children() buys nothing except elimination 
of zombie processes by trading it in for causing calls of rad_fork(0) to 
fail.

(why 0 and not -1
main waitpid.....

        -1     meaning wait for any child process.

        0      meaning wait for any child process whose process group ID 
is equal to that of the calling process.

)

I am getting errors forking my scripts off with rlm_exec -- apparently 
thread_pool.waiters is getting full somehow, since this is logged.

Error: Couldn't fork /usr/local/sbin/radius-dns-update.sh: No child 
processes

Which isnt a documented errno for fork()

grep "No child" /usr/include/ -r
/usr/include/asm-generic/errno-base.h:#define   ECHILD          10 
/* No child processes */

The fix? Here are some ideas.

1)

reap_children() should only call waitpid() for entries in 
thread_pool.waiters

-- this was the old way, should be fairly easy to revert to.

This was changed due to zombies being created by creative forking in 
rlm_perl.

I think that problem should have properly been resolved by using 
rad_fork() or by building its own long lived fork/wait mechanism or by 
properly daemonizing the child process.

-- rad_fork(1)s or fork()s that are not followed by waitpid()s will 
result in zombies (which is a bug owned by the code doing the call to fork)

IMHO this makes the most sense -- freeradius should not try to fix other 
code bugs by introducing all the breakage as descibed above.

2)

reap_children() should create a thread_pool.waiters entry for pids that 
have exited for subsequent retrieval of status by rad_waitpid

-- rad_fork(1)s or fork()s that are not followed by waitpid()s will 
result in thread_pool.waiters running out of elements.

-- waitpid() is still unreliable

-- rad_fork(1) is still (nearly) meaningless

This can be handled by rad_fork() behaving differently, such as deleting 
all thread_pool.waiters elements that created by reap_children() if the 
thread_pool.waiters reaches its limit.

This is a series of nasty kludges and workarounds for a problem that 
should exist in the first place.

Attached find proof-of-concept patch

3)

calls to rad_fork(1) should also result in creation of 
thread_pool.waiters element

-- This doesnt completely solve anything and 2 is still needed.

-- rad_fork(1) is meaningless

4)

rad_fork() should mark elements in the hash table as deletable when the 
table is full -- after all, this is what exec_wait == 0 means, right?

Perhaps to make it as correct as possible, it should only delete a 
percentage of the table, sorted by oldest entry. That might be hard.

Why have a waiters pool anyway when calling waitpid(0 ?

exec_wait == 1 processes get waited upon and their status discarded
exec_wait == 0 processes get waited upon and their status kept until an 
explicit call to rad_waitpid() with the pid.

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: 360-rad_fork_reap_children_unknown_pids.dpatch
URL: <http://lists.freeradius.org/pipermail/freeradius-devel/attachments/20070117/693c652d/attachment.ksh>