Radiusd hangs on redis cluster failover (sometimes)
Milan Nikolic
gen2brain at gmail.com
Fri Aug 9 13:25:05 CEST 2019
>
> Which still has "Debug:" added to the start of every line. So you've
> done something *other* than just "radiusd -X".
I did this:
[root at node2 idbox]# cat etc/raddb/radiusd.conf | grep debug_level
debug_level = 0
[root at node2 idbox]# radiusd -f -X -d etc/raddb/ > /tmp/radius.log
^C[root at node2 idbox]# cat /tmp/radius.log | grep "^Debug" | wc -l
1369
Snapshot is from two weeks ago, I also noticed radclient now print Debug
lines, and detail logs for accounting messages now contain this line
https://github.com/FreeRADIUS/freeradius-server/blob/master/src/modules/proto_radius/proto_radius_acct.c#L125
,
i.e."No accounting section found..." , probably related to some recent
changes.
If you power off the Redis node correctly, then the kernel closes all
> active TCP connections. Which means that FreeRADIUS gets a notice that the
> connection is gone, and can handle it.
> If you just power off the Redis node, then the kernel thinks that the TCP
> connections are still active. And then tries to connect until a timeout is
> reached.
This hint about kernel can help, will check state of TCP connections and
see if I can alter behavior with some Redis options, like tcp-keepalive.
Thanks,
Milan
On Thu, Aug 8, 2019 at 8:13 PM Alan DeKok <aland at deployingradius.com> wrote:
> On Aug 8, 2019, at 12:05 PM, Milan Nikolic <gen2brain at gmail.com> wrote:
> > Nothing is different, they use the same build of FreeRADIUS, the same
> > config, everything should be the same, install is done from my custom RPM
> > repository. Every node connects to Redis via 127.0.0.1, ports 7001-7008.
>
> Hmm... OK.
>
> > Sorry about that, the log file is now here
> https://pastebin.com/raw/5TPP9vEh
>
> Which still has "Debug:" added to the start of every line. So you've
> done something *other* than just "radiusd -X".
>
> > After the last line, when it tries to contact the node that is down, it
> > hangs and must be killed.
>
> The short answer is that the rlm_redis code is blocking in v4. It has a
> timeout, so it should come back at some point.
>
> But the rlm_redis module needs to be converted to use the async Redis
> API. Which is a matter of ongoing work.
>
> > What I noticed so far, I use virtual machines for testing, if I power
> off,
> > i.e. unplug power then radius on active node hangs, but if I do a proper
> > shut down it gets the new cluster topology and continues to work.
> > That doesn't make much sense to me, i.e. how is that related to the
> > problem, but that is what I see.
>
> If you power off the Redis node correctly, then the kernel closes all
> active TCP connections. Which means that FreeRADIUS gets a notice that the
> connection is gone, and can handle it.
>
> If you just power off the Redis node, then the kernel thinks that the
> TCP connections are still active. And then tries to connect until a
> timeout is reached.
>
> The recommendations are:
>
> a) don't hard-power critical systems
>
> b) if v4 works, great. If it doesn't, submit patches, or use v3.
>
> The short answer is "only use v4 if you know what you're doing."
>
> Alan DeKok.
>
>
> -
> List info/subscribe/unsubscribe? See
> http://www.freeradius.org/list/users.html
More information about the Freeradius-Users
mailing list