Radiusd hangs on redis cluster failover (sometimes)

Milan Nikolic gen2brain at gmail.com
Fri Aug 9 13:25:05 CEST 2019


>
>   Which still has "Debug:" added to the start of every line.  So you've
> done something *other* than just "radiusd -X".


I did this:

     [root at node2 idbox]# cat etc/raddb/radiusd.conf | grep debug_level
    debug_level = 0
    [root at node2 idbox]# radiusd -f -X -d etc/raddb/ > /tmp/radius.log
    ^C[root at node2 idbox]# cat /tmp/radius.log | grep "^Debug" | wc -l
    1369

Snapshot is from two weeks ago, I also noticed radclient now print Debug
lines, and detail logs for accounting messages now contain this line
https://github.com/FreeRADIUS/freeradius-server/blob/master/src/modules/proto_radius/proto_radius_acct.c#L125
,
i.e."No accounting section found..." , probably related to some recent
changes.


If you power off the Redis node correctly, then the kernel closes all
> active TCP connections.  Which means that FreeRADIUS gets a notice that the
> connection is gone, and can handle it.
> If you just power off the Redis node, then the kernel thinks that the TCP
> connections are still active.  And then tries to connect until a timeout is
> reached.


This hint about kernel can help, will check state of TCP connections and
see if I can alter behavior with some Redis options, like tcp-keepalive.

Thanks,
Milan


On Thu, Aug 8, 2019 at 8:13 PM Alan DeKok <aland at deployingradius.com> wrote:

> On Aug 8, 2019, at 12:05 PM, Milan Nikolic <gen2brain at gmail.com> wrote:
> > Nothing is different, they use the same build of FreeRADIUS, the same
> > config, everything should be the same, install is done from my custom RPM
> > repository. Every node connects to Redis via 127.0.0.1, ports 7001-7008.
>
>   Hmm... OK.
>
> > Sorry about that, the log file is now here
> https://pastebin.com/raw/5TPP9vEh
>
>   Which still has "Debug:" added to the start of every line.  So you've
> done something *other* than just "radiusd -X".
>
> > After the last line, when it tries to contact the node that is down, it
> > hangs and must be killed.
>
>   The short answer is that the rlm_redis code is blocking in v4.  It has a
> timeout, so it should come back at some point.
>
>   But the rlm_redis module needs to be converted to use the async Redis
> API.  Which is a matter of ongoing work.
>
> > What I noticed so far, I use virtual machines for testing, if I power
> off,
> > i.e. unplug power then radius on active node hangs, but if I do a proper
> > shut down it gets the new cluster topology and continues to work.
> > That doesn't make much sense to me, i.e. how is that related to the
> > problem, but that is what I see.
>
>   If you power off the Redis node correctly, then the kernel closes all
> active TCP connections.  Which means that FreeRADIUS gets a notice that the
> connection is gone, and can handle it.
>
>   If you just power off the Redis node, then the kernel thinks that the
> TCP connections are still active.  And then tries to connect until a
> timeout is reached.
>
>   The recommendations are:
>
> a) don't hard-power critical systems
>
> b) if v4 works, great.  If it doesn't, submit patches, or use v3.
>
>   The short answer is "only use v4 if you know what you're doing."
>
>   Alan DeKok.
>
>
> -
> List info/subscribe/unsubscribe? See
> http://www.freeradius.org/list/users.html


More information about the Freeradius-Users mailing list