Radiusd hangs on redis cluster failover (sometimes)

Milan Nikolic gen2brain at gmail.com
Thu Aug 8 18:05:51 CEST 2019


>
> What "other" node?  I know there's a cluster, but what is different
> between the two nodes?


Nothing is different, they use the same build of FreeRADIUS, the same
config, everything should be the same, install is done from my custom RPM
repository. Every node connects to Redis via 127.0.0.1, ports 7001-7008.

  Please don't attach log files as zips.  The mailing list deletes them.
> Attach log files in-line.  If they're too large, put them on a pastebin web
> site somewhere.


Sorry about that, the log file is now here https://pastebin.com/raw/5TPP9vEh
.

After the last line, when it tries to contact the node that is down, it
hangs and must be killed.
What I noticed so far, I use virtual machines for testing, if I power off,
i.e. unplug power then radius on active node hangs, but if I do a proper
shut down it gets the new cluster topology and continues to work.
That doesn't make much sense to me, i.e. how is that related to the
problem, but that is what I see.

Thanks,
Milan


On Thu, Aug 8, 2019 at 4:37 PM Alan DeKok <aland at deployingradius.com> wrote:

> On Aug 7, 2019, at 1:37 PM, Milan Nikolic <gen2brain at gmail.com> wrote:
> >
> > I have an issue with FreeRADIUS 4.0.x and Redis cluster. When I shut
> > down one of the nodes (all have freeradius and use redis cluster),
> > redis recovers and cluster state is OK but it seems freeradius doesn't
> > refresh cluster topology, and when I send a packet to one of the
> > working nodes it is trying to send command to node that is down and
> > then just hangs and doesn't return response. I cannot stop radiusd
> > after that (i.e. ctrl+c doesn't work) and it must be killed.
>
>   That isn't good.
>
> > The last line in log is this, and nothing is printed after that:
> >
> > Debug : (7)        rediswho - [16] >>> Sending command(s) to
> > 192.168.1.8:7004 (fr_redis_cluster_state_init)
> >
> > Btw. I changed the message in cluster.c just to confirm which function
> > is called (there are two same Sending command(s) msg in that file), it
> > is this line
> https://github.com/FreeRADIUS/freeradius-server/blob/master/src/lib/redis/cluster.c#L1784
> > .
> > So 192.168.1.8 is the node that I shut down to test high availability,
> > and I send packet after redis is recovered.
> >
> > This doesn't happen when I shut down the other node,
>
>   What "other" node?  I know there's a cluster, but what is different
> between the two nodes?
>
> > I can see in log
> > how radius refreshes cluster topology and everything just continues to
> > work. Before every test, I always make sure cluster state is ok and
> > master/slaves are in balance on all nodes.
> >
> > Attached is a log file I get with `radiusd -X` on the node that fails
> > and hangs after it tries to contact node that is down.
>
>   Please don't attach log files as zips.  The mailing list deletes them.
> Attach log files in-line.  If they're too large, put them on a pastebin web
> site somewhere.
>
>   Alan DeKok.
>
>
> -
> List info/subscribe/unsubscribe? See
> http://www.freeradius.org/list/users.html


More information about the Freeradius-Users mailing list