home server debugging issues

Fri Nov 27 10:29:54 CET 2009

On Thu, Nov 26, 2009 at 06:17:29PM +0100, Alan DeKok wrote:
> Josip Rodin wrote:
> > I upgraded one of our proxy servers from 2.0.4 to 2.1.7, and noticed that
> > the proxying changed in a way that "status_check = request" logic started
> > being critical, so this kind of stuff:
> > 
> > Sun Nov 22 09:25:56 2009 : Error: Rejecting request 70011 due to lack of any response from home server X port 1812
> > 
> > ...was replaced, without a change in home server configuration, with:
> 
>   It wasn't replaced, it just happens less often.
> 
> > It was unclear to me why didn't FreeRADIUS notice this as soon as it first
> > happened, and when it eventually happened, why didn't it explicate the
> > rationale. So I looked and found these in src/main/event.c:
> 
>   Odds are your config handles the "no response" packets.  So the above
> message happens less often.

Returning to the original problem, in my pool of two fail-over home servers
I now have both of them set up with "status_check = none".

My upstream proxy maintainers refuse to implement decent status checks,
so I'm forced to do this for now. I can do a status check with an entry
from a particular HL RADIUS that I happen to control, but that just creates
a daisy-chain of SPoFs. :/ They insist that I not do anything like this,
but that I set up my server so that it stubbornly tries their first server,
then if that fails their second server, for each request.

Now, when a request comes through that gets discarded by the first proxy
(because it itself times out on a random HL RADIUS), that one gets marked as
a zombie. Strangely enough, my server keeps it marked as a zombie even after
several minutes (long past any of the zombie_period and revive_interval
periods I've kept in the configuration). My server keeps talking only with
the second server which is in the 'alive' state, and ignores the zombie.

After re-reading proxy.conf comments, this actually looks logical - there is
no kind of a status check that would unmark it as a zombie. revive_interval
can resurrect it from the 'dead' state, but not from the zombie state. Also
this part of the revive_interval comment is a bit confusing:

        #  As a result, we recommend enabling status checks, and
        #  we do NOT recommend using "revive_interval".
        #
        #  The "revive_interval" is used ONLY if the "status_check"
        #  entry below is not "none".  Otherwise, it will not be used,
        #  and should be deleted.

So it's supposed to be a crutch only for people who *have* status checks,
but not a crutch for those of us who do *not* have status checks.

What is a crutch for this situation? A cron job that keeps doing
radmin -e 'set home_server state X Y alive'? :)

-- 
     2. That which causes joy or happiness.