Handling unreliable proxy partners

Wed May 19 19:20:05 CEST 2021

We proxy a lot of authentication to radius servers that we do not manage/control.

In a number of failure scenarios in order to maintain a good customer experience it is better for us to give access to the network even though we can't authenticate the users so customers continue to receive service even if that means giving some non-customer access. We might restrict the length of these sessions or turn on QoS restrictions.

Using fallback virtual servers and (in non fallback virtual servers) Post-Proxy-Type Fail-Authentication allows us to handle the cases where no radius servers are responding for a partner or proxying of a single request fails, by returning an access-accept ourselves and automatically kicks in when one of those situations presents itself. Any common radius attributes that need to be set to put restrictions on those sessions can be put in a policy and called from each section rather than duplicating code.

This works well.

We'd also like a manual mechanism that our support team can trigger to cover other failure scenarios, eg  the remote radius server is incorrectly returning access-reject for all valid users, and those scenarios that we haven't been able to think of but will occur, inevitably at the most inconvenient of times.

My first attempt at this was that the support team could use radmin to set the home servers to dead which would mean packets were routed via the falback virtual server. I initially thought this worked as a solution, but if FreeRadius is doing status checks against the remote servers then it will automatically bring them back into service as long as the status check requests are responding, which if say the remote partner is responding with access-rejects to even valid users is not what you want.

I also considered using ip tables to drop the packets, which would trigger the use of the fallback virtual server, but if you'd decided to go into this state of accepting everything because of some intermittent network problem then dropping all packets with ip tables would make debugging that problem using radclient/radtest very difficult.

One idea I haven't explored is having two copies of each virtual server, in different files, one for the normal situation and one for the failure situation and switching which one to use using symlinks and radmin to reload the configuration.

So far what I have come up with so far is within a virtual server pre-proxy section to use the exec module to call a simple shell script that check for the presence of flag files indicating which if any partners are in a bad state. The support team are responsible for creating these files. If any flag files are present the the script adds a radius attribute for each, the value indicating which partner. In the pre-proxy section I can then check for this attribute and value if it indicates that the partner the virtual server is handling is in a failure state then call accept from the always module which will cancel the proxying attempt and send an access-accept. We can also call any policy that would also get called in the fallback virtual server or Post-Proxy-Type Fail-Authentication if we want common radius attributes to be returned in the response to apply some sort of QoS restriction.

The rlm_exec documentation states using exec is very slow and something like the perl module would be more appropriate for a live environment. Before I carry on down the path of performance testing this and trying perl/python/rest/custom C module does anyone have any thoughts/observations or alternative suggestions?

Thanks,

Paul