Handling unreliable proxy partners

Wed May 19 19:34:51 CEST 2021

I wrote code to cache every user of the downstream systems into a local
ldap database during post-auth with each realm as a sub OU in the directory
as we were only using username and password rather than EAP-TLS. That way I
could cache and encrypt the password as well as any additional VSAs the
client was sending as some had a per user static IP or a select range of
VSAs they could add.

Then I had a very low retry and timeout values for them and used a fail
virtual server to handle looking up ldap based on the realm them check the
password and return any VSAs as needed.

Worked well for some very critical systems and all of the complaints about
general flakiness for some clients just went away. I think I had something
like 500 downstream realms but the code just created realms in ldap as
needed using some Perl.

Also added a attribute on the OU to disable caching if the customer wanted
the requests to fail if their server wasn’t responding.

Hardest part was having notifications on up/down of the virtual server as
it just handled it in the background.

On Thu, 20 May 2021 at 05:20, Paul Moser via Freeradius-Users <
freeradius-users at lists.freeradius.org> wrote:

>
> We proxy a lot of authentication to radius servers that we do not
> manage/control.
>
> In a number of failure scenarios in order to maintain a good customer
> experience it is better for us to give access to the network even though we
> can't authenticate the users so customers continue to receive service even
> if that means giving some non-customer access. We might restrict the length
> of these sessions or turn on QoS restrictions.
>
> Using fallback virtual servers and (in non fallback virtual servers)
> Post-Proxy-Type Fail-Authentication allows us to handle the cases where no
> radius servers are responding for a partner or proxying of a single request
> fails, by returning an access-accept ourselves and automatically kicks in
> when one of those situations presents itself. Any common radius attributes
> that need to be set to put restrictions on those sessions can be put in a
> policy and called from each section rather than duplicating code.
>
> This works well.
>
> We'd also like a manual mechanism that our support team can trigger to
> cover other failure scenarios, eg  the remote radius server is incorrectly
> returning access-reject for all valid users, and those scenarios that we
> haven't been able to think of but will occur, inevitably at the most
> inconvenient of times.
>
> My first attempt at this was that the support team could use radmin to set
> the home servers to dead which would mean packets were routed via the
> falback virtual server. I initially thought this worked as a solution, but
> if FreeRadius is doing status checks against the remote servers then it
> will automatically bring them back into service as long as the status check
> requests are responding, which if say the remote partner is responding with
> access-rejects to even valid users is not what you want.
>
> I also considered using ip tables to drop the packets, which would trigger
> the use of the fallback virtual server, but if you'd decided to go into
> this state of accepting everything because of some intermittent network
> problem then dropping all packets with ip tables would make debugging that
> problem using radclient/radtest very difficult.
>
> One idea I haven't explored is having two copies of each virtual server,
> in different files, one for the normal situation and one for the failure
> situation and switching which one to use using symlinks and radmin to
> reload the configuration.
>
> So far what I have come up with so far is within a virtual server
> pre-proxy section to use the exec module to call a simple shell script that
> check for the presence of flag files indicating which if any partners are
> in a bad state. The support team are responsible for creating these files.
> If any flag files are present the the script adds a radius attribute for
> each, the value indicating which partner. In the pre-proxy section I can
> then check for this attribute and value if it indicates that the partner
> the virtual server is handling is in a failure state then call accept from
> the always module which will cancel the proxying attempt and send an
> access-accept. We can also call any policy that would also get called in
> the fallback virtual server or Post-Proxy-Type Fail-Authentication if we
> want common radius attributes to be returned in the response to apply some
> sort of QoS restriction.
>
> The rlm_exec documentation states using exec is very slow and something
> like the perl module would be more appropriate for a live environment.
> Before I carry on down the path of performance testing this and trying
> perl/python/rest/custom C module does anyone have any thoughts/observations
> or alternative suggestions?
>
>
> Thanks,
>
> Paul
> -
> List info/subscribe/unsubscribe? See
> http://www.freeradius.org/list/users.html