home server debugging issues

Fri Nov 27 11:30:02 CET 2009

Josip Rodin wrote:
> Returning to the original problem, in my pool of two fail-over home servers
> I now have both of them set up with "status_check = none".

  2.1.7 has some changes in proxy fail-over.  The *first* packet that
discovers that a home server is dead is no longer rejected.  Instead, it
fails over to the second home server.

  This makes proxying more robust.

> My upstream proxy maintainers refuse to implement decent status checks,
> so I'm forced to do this for now. I can do a status check with an entry
> from a particular HL RADIUS that I happen to control, but that just creates
> a daisy-chain of SPoFs. :/ They insist that I not do anything like this,
> but that I set up my server so that it stubbornly tries their first server,
> then if that fails their second server, for each request.

  That's stupid.  It increases latency, bandwidth used, and decreases
reliability.

  The Status-Server draft says that using Status-Server is preferable to
the alternatives.  Maybe they'll follow it once it becomes an RFC.

> Now, when a request comes through that gets discarded by the first proxy
> (because it itself times out on a random HL RADIUS), that one gets marked as
> a zombie. Strangely enough, my server keeps it marked as a zombie even after
> several minutes (long past any of the zombie_period and revive_interval
> periods I've kept in the configuration). My server keeps talking only with
> the second server which is in the 'alive' state, and ignores the zombie.

  Hmm... the "zombie_period" timers depend on continued packet streams.
 If the NAS doesn't re-transmit packets, then it could stay zombie for a
while.  I'll have to take a look at that.

> After re-reading proxy.conf comments, this actually looks logical - there is
> no kind of a status check that would unmark it as a zombie. revive_interval
> can resurrect it from the 'dead' state, but not from the zombie state. Also
> this part of the revive_interval comment is a bit confusing:
> 
>         #  As a result, we recommend enabling status checks, and
>         #  we do NOT recommend using "revive_interval".
>         #
>         #  The "revive_interval" is used ONLY if the "status_check"
>         #  entry below is not "none".  Otherwise, it will not be used,
>         #  and should be deleted.
> 
> So it's supposed to be a crutch only for people who *have* status checks,
> but not a crutch for those of us who do *not* have status checks.

  Huh?  That's not what it says.  It says "revive_interval" is ONLY for
people who have "status_check = none".  i.e. no status checks.

> What is a crutch for this situation? A cron job that keeps doing
> radmin -e 'set home_server state X Y alive'? :)

  If you don't have status-checks, then the "revive_interval" should
apply.  If it's not being applied, that should be fixed.

  Alan DeKok.