Proxy realms and home_server_pool fallback not working

Tue Mar 8 22:54:29 CET 2016

On Wed, Mar 9, 2016 at 5:19 AM, Alan DeKok <aland at deployingradius.com>
wrote:

> On Mar 8, 2016, at 4:24 AM, Peter Lambrechtsen <peter at crypt.co.nz> wrote:
> > This doesn't seem to work in 3.0.x head, I will test it on 3.1.x
> tomorrow.
>
>   I've pushed a fix.
>

That's fixed it... Brilliant :)

>
> > I think this must work in 3.1 as it doesn't work for me in 3.0.x head
> from
> > last week, as I just tried this and fallback didn't seem to get applied
> at
> > all.
>
>   v3.0.x head worked for me yesterday what I tried that.
>
>   The "fallback" code for home_server_pools is independent of the type of
> the home_server_pool.
>
>   You may be running into a timer issue... i.e. if the timers are short,
> the home_server is marked alive, and the fallback is never used.
>
>   I used "radmin" to forcibly set the home_server state to "dead".  That
> avoids the timer issues, and the fallback works correctly.
>

I think that was my issue, as I was using a second VM on the network as the
proxy destination I was shutting down the destination server and not
waiting for the zombie period to expire.

(9)   } # authorize = updated
Home server pool ProxyDestPool failing over to fallback cacheuser
(9) # Executing section pre-proxy from file ./sites-enabled/default

That seems to be my issue, I've just re-tested that with 3.0.x head and had
the zombie_timeout set too high. After I wound that number down to the same
as check_interval and once the server went to zombie then the fallback
occurred.

        zombie_period = 10
        check_interval = 10
        num_answers_to_alive = 2

This way once the server has been offline for 10 seconds it's zombied and
fallback occurs. Then the check interval 10 x 2 means after the server has
been back up and responded to 2x alives then it goes back to the remote
proxy server.

Granted I won't have the values set this low in production, but since this
will be a high volume server with some critical services on it. I suspect I
will stick with 30 seconds or 1 min for the check interval but keep the
zombie value at 20 seconds. So if a radius server dies or becomes
unresponsive we don't wait around until we mark it zombie before we start
authing everyone locally. Then have a reasonable backoff before we attempt
to start authing again.

Many thanks again.

Peter