fork and wait
Joe Maimon
jmaimon at ttec.com
Wed Jan 17 15:40:15 CET 2007
code calls rad_fork(1) (such as rad_check_ts())
This means that no entry is stored in thread_pool.waiters
This means that rad_waitpid() calls reap_children() which means that
subsequent calls to waitpid() -- since there is no thread_pool.waiters
entry for the pid -- return an ECHILD, which leads rad_check_ts() to
print this message
radlog(L_ERR, "Check-TS: unknown error in waitpid()");
So basically rad_fork(1) and rad_waitpid() and waitpid() are now unreliable.
Furthermore, using rad_fork(0) without rad_waitpid(pid,...,...) will
eventualy result in a full thread_pool.waiters hashtable and calls to
rad_fork(0) will fail.
So waitpid(0,...,...) in reap_children() buys nothing except elimination
of zombie processes by trading it in for causing calls of rad_fork(0) to
fail.
(why 0 and not -1
main waitpid.....
-1 meaning wait for any child process.
0 meaning wait for any child process whose process group ID
is equal to that of the calling process.
)
I am getting errors forking my scripts off with rlm_exec -- apparently
thread_pool.waiters is getting full somehow, since this is logged.
Error: Couldn't fork /usr/local/sbin/radius-dns-update.sh: No child
processes
Which isnt a documented errno for fork()
grep "No child" /usr/include/ -r
/usr/include/asm-generic/errno-base.h:#define ECHILD 10
/* No child processes */
The fix? Here are some ideas.
1)
reap_children() should only call waitpid() for entries in
thread_pool.waiters
-- this was the old way, should be fairly easy to revert to.
This was changed due to zombies being created by creative forking in
rlm_perl.
I think that problem should have properly been resolved by using
rad_fork() or by building its own long lived fork/wait mechanism or by
properly daemonizing the child process.
-- rad_fork(1)s or fork()s that are not followed by waitpid()s will
result in zombies (which is a bug owned by the code doing the call to fork)
IMHO this makes the most sense -- freeradius should not try to fix other
code bugs by introducing all the breakage as descibed above.
2)
reap_children() should create a thread_pool.waiters entry for pids that
have exited for subsequent retrieval of status by rad_waitpid
-- rad_fork(1)s or fork()s that are not followed by waitpid()s will
result in thread_pool.waiters running out of elements.
-- waitpid() is still unreliable
-- rad_fork(1) is still (nearly) meaningless
This can be handled by rad_fork() behaving differently, such as deleting
all thread_pool.waiters elements that created by reap_children() if the
thread_pool.waiters reaches its limit.
This is a series of nasty kludges and workarounds for a problem that
should exist in the first place.
Attached find proof-of-concept patch
3)
calls to rad_fork(1) should also result in creation of
thread_pool.waiters element
-- This doesnt completely solve anything and 2 is still needed.
-- rad_fork(1) is meaningless
4)
rad_fork() should mark elements in the hash table as deletable when the
table is full -- after all, this is what exec_wait == 0 means, right?
Perhaps to make it as correct as possible, it should only delete a
percentage of the table, sorted by oldest entry. That might be hard.
Why have a waiters pool anyway when calling waitpid(0 ?
exec_wait == 1 processes get waited upon and their status discarded
exec_wait == 0 processes get waited upon and their status kept until an
explicit call to rad_waitpid() with the pid.
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: 360-rad_fork_reap_children_unknown_pids.dpatch
URL: <http://lists.freeradius.org/pipermail/freeradius-devel/attachments/20070117/693c652d/attachment.ksh>
More information about the Freeradius-Devel
mailing list