Radiusd hangs on redis cluster failover (sometimes)
aland at deployingradius.com
Thu Aug 8 20:12:13 CEST 2019
On Aug 8, 2019, at 12:05 PM, Milan Nikolic <gen2brain at gmail.com> wrote:
> Nothing is different, they use the same build of FreeRADIUS, the same
> config, everything should be the same, install is done from my custom RPM
> repository. Every node connects to Redis via 127.0.0.1, ports 7001-7008.
> Sorry about that, the log file is now here https://pastebin.com/raw/5TPP9vEh
Which still has "Debug:" added to the start of every line. So you've done something *other* than just "radiusd -X".
> After the last line, when it tries to contact the node that is down, it
> hangs and must be killed.
The short answer is that the rlm_redis code is blocking in v4. It has a timeout, so it should come back at some point.
But the rlm_redis module needs to be converted to use the async Redis API. Which is a matter of ongoing work.
> What I noticed so far, I use virtual machines for testing, if I power off,
> i.e. unplug power then radius on active node hangs, but if I do a proper
> shut down it gets the new cluster topology and continues to work.
> That doesn't make much sense to me, i.e. how is that related to the
> problem, but that is what I see.
If you power off the Redis node correctly, then the kernel closes all active TCP connections. Which means that FreeRADIUS gets a notice that the connection is gone, and can handle it.
If you just power off the Redis node, then the kernel thinks that the TCP connections are still active. And then tries to connect until a timeout is reached.
The recommendations are:
a) don't hard-power critical systems
b) if v4 works, great. If it doesn't, submit patches, or use v3.
The short answer is "only use v4 if you know what you're doing."
More information about the Freeradius-Users