Robust Authentication Proxying

Sat Jul 11 19:08:19 CEST 2009

On Jul 11, 2009, at 10:15 AM, Alan DeKok wrote:

>  I think there's a fundamental disconnect here.  I'm trying to explain
> that RADIUS is an imperfect protocol.  You're trying to find ways of
> configuring FreeRADIUS to be work around those imperfections.
>
>  My suggestions are:
>
> 1) realize that RADIUS is imperfect.  If a home server fails, there
> will *always* be a request that is lost, rejected, timed out, etc.   
> The
> client WILL fail authentication and disconnect the user when this
> happens.  This is how RADIUS works.

You are talking to me like I do not understand how RADIUS works.  I  
understand RADIUS is imperfect.  My goal is to handle as many of the  
imperfections at the proxy level rather than at the end-user level  
(see below for why I want to do that).

>  2) proxy fail-over DOES work in the server.  Maybe not exactly the  
> way
> you want... but I recall asking you for specific suggestions as to how
> to make it better, and getting... not much.

I am not sure how I can be more clear than "when a response isn't  
received, retry the request to a different server rather than return  
treat it as a failure."  But you can read below for an algorithm for  
achieving this.  The algorithm is simple (I think) and can be enabled  
via an option or control VP, so it will not break any existing work.

>  3) If you want to try source code mods, go to src/main/event.c.  Look
> in the function no_response_to_proxied_request().  Find the line:
>
> 	post_proxy_fail_handler(request);
>
>  Delete it.  Re-compile && re-install radiusd.  Then try the fail over
> tests again.  It SHOULD cause fail-over to backup home servers for one
> request.  Do NOT configure the "do_not_respond" policy.  Try setting
> "response_window = 5" and "zombie_period = 5".

I did try that.  It did not do what I was attempting to do.  I am  
trying to patch with the algorithm I describe further on down.   
Freeradius can offer a robust transparent internal failover OR the  
existing failure handling.  It is not an either/or scenario.

>  4) Try setting the home server pool type to "load-balance" (again, as
> I have suggested).  It WILL still fail over from one server to  
> another.
> But the "load-balance" portion will spread the load MUCH more evenly
> across all home servers, and there will be FEWER failed requests  
> when a
> home server dies.

I am not sure how you ever got the impression that the home server  
pool type has been set to anything but "load-balance".  It has been  
set to "load-balance" since the very beginning.

>> I guess I do not see those as negatives.  That is exactly what I  
>> want to
>> happen.  RADIUS network traffic is tiny.  The system load created by
>> sending multiple requests to a home server or a bunch of home  
>> servers is
>> minimal. I am not seeing how you are adding any more load when  
>> instead,
>> the proxy sends back an Access-Reject, which, in the best case  
>> scenario,
>> will result in the end-client re-authenticating, generating yet  
>> another
>> request.
>
>  You will be sending packets to TWO home servers, rather than one.
> This might be fine in your situation.  It is definitely not fine in
> other situations.

In the case that one of the home servers fails to respond, yes, that  
is what that means.  I agree that is not ideal for all situations.   
Such behavior can be controlled via options.  Again, this is not an  
either/or scenario.  I am not asking you to break the existing  
implementation.  I am simply looking for a solution to my needs within  
the current framework.

>  FreeRADIUS is designed to work in a wide variety of environments.
> This means that it might NOT work exactly the way you demand.  The
> solution is simple: you have source code.  Fix it.  If we add a fix  
> that
> will make *your* situation work, it is likely to break *other* peoples
> networks.
>
>  We can't take that risk.

Again, that is what configuration options are for, so people can  
control their setups for their configurations.  You make it sound like  
I want to ditch your current methods of doing things.  I do not.  I am  
simply trying to find more flexibility in how things are handled.  I  
guarantee you that a solution for this will not break anyone else's  
networks unless they configure it.

>  Which method is working in 100,000 deployments?
>
>  You should note that I asked you for *specific* suggestions for a
> better algorithm.  Your response was "I want it to fail over sooner".
> That is unhelpful.

That is not what I requested.  I requested that it not send back a  
rejection to the NAS and instead internally retry the request with a  
different home server.  That is not "fail over sooner."  That is "fail  
over transparently."

One way of accomplishing this (an algorithm as you have requested) is  
to realize that when the NAS does not get back a response, it will (if  
configured) retransmit the original request.  When the proxy does not  
get back a response from the home server it chose (after whatever  
length of time), it can fail out the home server, not send any  
response back to the NAS, forget that it ever saw the original request  
from the NAS, and when the NAS retransmits its request, the proxy will  
see it as a new original request, choose a new home server (the  
previous one has been failed out), and then hopefully the proxy will  
get a response from this new home server that it will pass back to the  
NAS.  If the NAS is configured with a relatively long authentication  
timeout and the proxy is configured with a relatively short response  
window for the home server, then that should give enough time for the  
proxy to try multiple home servers, each try being triggered by a  
retransmit from the NAS, before the NAS treats the proxy as having  
timed out and then moves on to timeout handling (see below for a  
scenario where timeout handling is different than rejection handling).

This requires relatively quick fail out of home servers, something  
which can already be configured.  It requires that the tracking hash  
of the original request packet and proxy be cleared.  Both of these  
can be enabled via an option, probably at the pool level, so that it  
does not change the behavior for existing configs, but adds this  
capability for people who want this kind of transparent failover.   
However, I think the cleanest way to implement is to add a new  
Response-Packet-Type of (suggestion) 'Proxy-Clear-Response' that is  
treated like the above for proxied requests and is treated as 'Do-Not- 
Respond' if not used within the a post-proxy failure handler.

>  If you are so set on demanding something better, then offer  
> *concrete*
> suggestions for how to fix it.  Look at the code.  It's available,
> commented, and reasonably clean.
>
>  Come up with a *better* method, and we'll implement it.  The current
> repetition of "it's bad and I want it to work differently" isn't  
> useful.

Does that seem like a method that can work.  Again, not to replace  
anything but to supplement it?

>>> If you want the proxy to fail over, send it more than ONE request  
>>> at a
>>> time (like a normal proxying system), and do NOT configure the "do  
>>> not
>>> respond" policy.
>>
>> So my NAS now has to send two separate requests for the same
>> authentication, and pick the one that does not come back with an
>> Access-Reject?
>
>  That is not what I meant.  You seemed to be claiming that it NEVER
> failed over.  I pointed out that it does fail over, and gave an  
> example
> of when  and how it fails over.

Again .... transparent failover ... that NAS should not have to get an  
error to effect a failover on the proxy.

>> To a NAS, there is a big difference between a timeout and a  
>> reject.  If
>> it does not get a response, a NAS will typically handle the client
>> differently than if it gets an explicit rejection.
>
>  Huh?  How?  Will it accept the user?  Will it let them in?  Will it
> give them some "minimal" service, even if they weren't authenticated?
>
>  That violates all RADIUS specifications, best practices, and network
> security guidelines I'm aware of.

Many NASes can use an internal user cache as a backup to a non- 
responding or slowly-responding RADIUS server.  If the proxy returns a  
an actual Access-Reject, the NAS accepts that and says the request is  
invalid.  If the proxy returns nothing, the NAS can say, "Well, my  
RADIUS server is down, but I have this record for the same user/pass  
in my cache and previously, it received an Access-Accept.  Let me  
accept this request."  That does not break any RADIUS specifications I  
know of, but it does provide a better experience for the end-client,  
ie. a down RADIUS server does not completely kill your authentication  
abilities.  Certainly, if the home server returns a reject, that  
rejection should get passed back to the end-client by the NAS,  
regardless of what it may have cached, but if the home server passes  
back nothing, the NAS should not receive a rejection.  The home RADIUS  
server never rejected anything.  I would like to let the NAS decide if  
a timeout is treated as a rejection or not, not the proxy.  Right now,  
I can do that with do_not_respond, BUT it would be even better if the  
proxy would try harder to contact more home servers if my NAS is  
waiting longer for a RADIUS server to respond.

>> Right now, a timeout
>> event from the home server results in an explicit rejection (unless I
>> configure it not to send that reject).  It IS possible to get the  
>> number
>> down to zero, because I have used RADIUS software that does it.  The
>> only time it should ever be non-zero is if all home servers that can
>> possibly be tried in a given window (which might not be all of  
>> them, but
>> is most likely going to be more than one of them) fail to respond.   
>> Like
>> I said, I am trying to migrate to freeradius for some other  
>> features.  I
>> have used two other proprietary RADIUS server software packages that
>> implement this behavior.
>
>  Well... offer *specific* suggestions for changes to FreeRADIUS that
> will help implement this.  Try the suggested patches, and see if  
> they help.

I will try specific patches.

>> I also understand that Accept-Challenge can complicate the  
>> proxying, but
>> that is solvable as well with standard state tracking.
>
>  That I disagree with.  EAP makes re-routing proxied Access-Challenges
> pretty much impossible. (Except in certain rare situation)

Right.  You cannot reroute them.  You just have to make sure they get  
destined for the home server that can handle them, which is usually  
the home server that handled the initial Access-Request.  Like I said,  
you just make sure that happens by tracking where the original Access- 
Challenge response came from, or you ignore it and you say that if a  
failover happens in between the Access-Request and the Access- 
Challenge, the end-user will receive an error.

>> I know that networks are imperfect.  The answer to that  
>> imperfection is
>> to retry, not to give up.  When you tell a NAS that the request has  
>> been
>> rejected when, in fact, it has not, you are not effectively retrying.
>> You are saying, "Do not retry.  You actually got this failed result."
>
>  No.  That's *your* NAS behavior.  Most NASes authenticate end users,
> who will hit the "connect network" button again when something fails.

Right, and my users complain when that happens.  I'm looking for a  
RADIUS solution that provides a better experience for my customers.  I  
am very understanding of the imperfections of networks in general and  
RADIUS specifically, but my end-users much less so.  It is easier to  
use software to reduce complaints than it is to educate every user on  
the intricacies of an authentication protocol.

> This is another cause of the miscommunication.  It seems that your
> NASes behave *very* differently from standard RADIUS NASes.  They  
> treat
> timeouts as "re-try authentication".

They do not.  See above.

>  But.. if they behave that way, why did they time out in the first
> place?  Why not just set the timeouts to infinity?  That way
> authentication will *never* fail.

Timeouts from the NASes standpoint are not treated as re-try  
authentication.  The timeouts on my NASes are long enough for the  
proxy the NAS communicates with to try multiple home servers.  The  
timeout on the proxy for each home server is shorter, so that it fails  
over to another home server more quickly, within the window of the  
authentication request on the NAS.  I still want my NASes to timeout.

>> But look, I have gone through the code.  Ivan's right, that there  
>> is no
>> way to get the behavior I want in freeradius without either a module
>> (not sure if this is even possible to accomplish via a module because
>> proxying is not handled via a module ) or by hacking the code to  
>> change
>> how proxy no-responses are handled.  It just frustrates me that you
>> challenge the value of this.
>
>  Nonsense.  I asked *specifically* for suggestions as to a better
> algorithm.  I'm refusing to implement a vague and poorly defined
> suggestion.  That shouldn't be a surprise.

I hope my description above of the proxy forgetting the original  
request when it fails out a home server is an example of a different  
algorithm.  I will not call it a better algorithm, because I am not  
looking for that algorithm to replace anything, just to supplement it  
as an additional option.  It is definitely a better algorithm for me.

>  Come up with a well-defined algorithm that's better than the current
> one, and we'll implement it.
>
>> For people like me who use freeradius not
>> to serve dial gear but to serve as robust authentication platforms  
>> for
>> on-network services, where sending a false rejection to a client is  
>> an
>> SLA issue, having a proxy that can robustly and transparently handle
>> transient network failures is very valuable.  With that, we do not  
>> have
>> to reprogram or replace NAS software (some of which we cannot  
>> control)
>> to handle those kinds of transient network failures for us.
>
>  I understand that.  Please also understand that it doesn't help to  
> say
> "make it better... I don't know HOW, but you guys need to make it  
> better".

Please tell me if the example I gave above is a good HOW.  I am trying  
to come up with a code-based solution today, but I am not nearly as  
familiar with the code as you probably are.

Philip