v2.x.x redundant-load-balance broken

Brian De Wolf bldewolf at cpp.edu
Wed Mar 25 03:14:01 CET 2015


On Tue, 24 Mar 2015 17:19:40 -0500
Alan DeKok <aland at deployingradius.com> wrote:

> > This causes the other tests to randomly fail, as it sometimes load
> > balances to the second member, which causes it to only try the three
> > fail modules.
> 
>   It should loop around to the beginning if there’s a failure.  The
> code should do that...
> 

The code to loop around is there and works, but because of the
off-by-one error it stops before that.  Since it only does N-1 tries,
it tries the 2nd, 3rd, 4th, then gives up.  If it had done N tries, it
would reach the 1st module again and succeed.

> >  What puzzles me is that, when I add this config instead:
> > 
> > redundant-load-balance {
> > 	fail
> > 	ok
> > 	fail
> > 	fail
> > }
> > 
> > I stop getting random failures.  When I add logging to record which
> > one we picked and to identify the module before we call
> > modcall_child, it says:
> > 
> > ++load-balance redundant-load-balance {
> > pick is 3
> > ++redundant-load-balance group redundant-load-balance {
> > trying 0x13a7780
> > +++[fail] = fail
> > +++[fail] = fail
> > trying 0x13a77e0
> > +++[fail] = fail
> > trying 0x13a7550
> > +++[fail] = fail
> > +++[ok] = ok
> > ++} # redundant-load-balance group redundant-load-balance = ok
> > 
> > It's not clear to me why it's listing fail multiple times for some
> > modcalls, or where that last ok comes from.
> 
>   Maybe your instrumentation code is wrong?
> 

Even with the standard debugging, it was printing >3 "+++[fail] = fail"
lines before the "+++[ok] = ok" line on a group with only 4 modules.  I
added the extra debugging lines to try to clear things up and they just
made me more confused.  I was hoping it was something obvious, like a
quirk from using fail/ok in a redundant-load-balance (because what kind
of silly person would do that?).

> > Anyway, I checked v3.x.x for the off-by-one error and it looks like
> > the loop was re-done to avoid count entirely.  Maybe more of the
> > v3.x.x code needs to be back ported?
> 
>   Try the change.  If it fixes the problem, send a patch, and I’ll
> put it in 5 min later.
> 

I'll try to poke at this some more next week.



More information about the Freeradius-Devel mailing list