Robust Authentication Proxying

Sat Jul 11 16:04:06 CEST 2009

On Jul 11, 2009, at 2:14 AM, Alan DeKok wrote:

> Philip Molter wrote:
>> Yes, this is the configuration I'm currently running, and it's not
>> working for me.  I have a radclient sending a request, retrying 10  
>> times
>> on a 5-second timer, and after 10 retries, it still hasn't gotten a
>> response.  After the second retry, the proxy has marked the server  
>> as at
>> least a zombie and started status-checks, but every retransmit after
>> that is getting a cached result of no response.
>
>  Could you possibly try READING my messages?
>
>  The default configuration does NOT include the "do_not_respond"
> policy.  *YOU* are the one who configured that, as I have said  
> multiple
> times.
>
>  If you don't want it to get the cached "do not respond", policy, then
>
> 	DON'T CONFIGURE IT
>
>  It's that easy.

I do not want to get ANY cached response.  I do not want to get any  
Access-Reject.  If I do not configure a 'do not respond' response, I  
get an Access-Reject, which is even worse because my end-client gets  
an error when he should not.  What I want is for a no-response from a  
home server to be treated as a no-response to the NAS, and the  
subsequent retransmit from the NAS to be processed as a retransmit to  
a different home server.

>> This is what I want to happen
>>
>> client req ->  proxy
>>               proxy req ->  home server #1
>> client ret ->  proxy
>>               proxy ret ->  home server #1
>>              [proxy fails home server #1 for lack of response]
>> client ret ->  proxy
>>               proxy req ->  home server #2
>>               proxy <- resp home server #2
>> client <- resp proxy
>
>  It does that (mostly).  But only if you don't break the server.

No, it does not do that all.  I have yet to see a retransmit from a  
client actually get tried on a different server than the one used for  
the original request.  Once the proxy fails to receive a response from  
the originally chosen home-server, it handles the packet as a  
failure.  If it sends back an Access-Reject packet, the request is  
rejected by the NAS to the client and that NAS stops retrying THAT  
REQUEST (ie. the end-client gets an error).  If I add a configuration  
to not send back anything, then the NAS will retransmit, but like you  
have made abundantly clear, the proxy remembers that you sent back no  
response to the original request and skips all further processing of  
the retransmit.

I have set response_windows and zombie_periods to minimums.  I have  
set response_windows and zombie_periods to maximums.  For a given  
single request, only one home server is tried, and if that home server  
is down, the request and any other retransmits of that request will  
not succeed.  Yes, if the NAS sends another separate request with a  
different ID, it will be proxied to a different home server, but that  
does not help the poor guy who had the hard luck of his request  
hitting the bad home server.  He will get an error message.  He will  
have to retry or call support or whatever.

>> This is what happens without a post-proxy config:
>>
>> client req ->  proxy
>>               proxy req -> home server #1
>> client ret ->  proxy
>>               proxy ret -> home server #1
>>              [proxy fails home server #1 for lack of response]
>> client  <- rej proxy
>
>  That happens for the most part because you played with the
> configuration to make the proxy timeouts super-short.  As I said,  
> don't
> do that.

It does not matter whether the timeouts are short or long.  This  
always happens.  See my note above.

In fact, no matter what I set the timeouts to, it always seems to fail  
the server and reject the request after the first retransmit to the  
proxy (2 packets, about 10 seconds, regardless of the response_window  
or zombie_period settings).  Yes, a subsequent, different request will  
go to a different home server, but, again, I want to use the proxy to  
provide smarter resiliency across a pool of servers.  If you know of  
settings for response_window and zombie_period that I can use that  
will provide the behavior in my "this is what I want to happen"  
example, could you provide them please?  Because all of the settings I  
use seem to result in the same behavior.

>> Okay, so I obviously do not understand how I can tweak  
>> response_window
>> and zombie_period to make sure that requests that can be serviced by
>> many possible RADIUS home servers do not return an Access-Reject when
>> one of those home servers does not respond.
>
>  i.e. you want NO request to fail processing when a home server fails.
>
>  This is extremely difficult to do.  Any naive approach that has quick
> failover can have other negative side-effects.  (Additional network
> traffic, system load, duplicate processing of requests, etc.)

I guess I do not see those as negatives.  That is exactly what I want  
to happen.  RADIUS network traffic is tiny.  The system load created  
by sending multiple requests to a home server or a bunch of home  
servers is minimal. I am not seeing how you are adding any more load  
when instead, the proxy sends back an Access-Reject, which, in the  
best case scenario, will result in the end-client re-authenticating,  
generating yet another request.  In the worst case scenario, the  
client accepts the reject as validation that their account cannot be  
authorized and presents the wrong result to the end-user (whether that  
be a guy sitting on the end of a dial-up line or a piece of system  
software trying to determine whether an account is valid).  All you  
are doing is pushing the logic for retrying from a machine that knows  
the there are multiple possible home servers that can respond to a  
machine that does not via a response that says, effectively, "Do not  
retry.  Your request is invalid."

Your argument that the RADIUS server cannot handle a retry does not  
hold water to me, but regardless, I can envision configurations where  
you would want to minimize all processing by the RADIUS proxy itself  
(most machines now have way more processing power than a simple RADIUS  
proxy can consume, so that is not a common need anymore).  I wish the  
option was available.  There seem to be knobs for a lot of other things.

>> The client sends a request to the proxy.  If a home server does not
>> respond within a short period of time to the request, a second home
>> server is chosen.  If the second home server does not respond to the
>> same request, then a third is chosen.  This continues until all  
>> possible
>> home servers are exhausted.  At that point, an Access-Reject packet  
>> is
>> sent back to the client.  Otherwise, the response from the home  
>> server
>> is sent back to the client.
>
>  Doing that requires source code mods, because that quick fail-over  
> can
> have negative side-effects.  i.e. The server does NOT support
> configurations that can negatively affect it's performance.

See my note above for why the work to be done by the server is no more  
or no less than just returning a reject once the timeout is hit.  You  
are either going to be processing more retries to the home server or  
more retries from the NAS.  Either way, you are going to increase your  
load.

>  On top of that, the "try all possible home servers" is impossible.
> There is ALSO a 30 second lifetime for the request.  After 30 seconds,
> the NAS has given up, so failing over to another home server is  
> useless.
>
>  On top of that, the NAS will only retry 3-6 times.  So if you have 19
> home servers, at *best* it would fail over to 3-6 of them, before the
> request is marked "timeout".

Okay, AT BEST you get 3-6 different home servers in a 30-second  
period.  Right now, AT BEST I get 1.  Which method is more resilient?   
Which method results in no false rejections being returned to the  
NAS?  The worst that can happen is that the NAS gets no response,  
which is exactly what would happen if the NAS queried that one home  
server directly.  The proxy can even be smart about it and only retry  
to a different home server when the NAS retransmits (which I believe  
it already does), so if the NAS stops retransmitting because it has  
given up, so does the proxy, but please, let the NAS give up first.   
The proxy does not know how many times that the NAS will retry.  I  
have my NASes configured to retry for up to 60 seconds, once every 2  
seconds.  They will retry 30 times.  It is more important to me that  
authentication requests succeed, even if they succeed slowly.  It  
sounds to me like freeradius is making assumptions about how NASes  
should work, and as a result, reducing the flexibility it provides.

>  I sincerely hope you see now that the situation is rather more
> complicated than the simple "try all home server" statement.
>
>> How do I configure that?  It doesn't seem to matter what I set
>> response_window or zombie_period to, once the first home server  
>> fails to
>> respond, an Access-Reject (or nothing if I configure a post-proxy
>> handler) is returned to the client.  My client's not going to retry  
>> the
>> request if he gets an Access-Reject, so I need the proxy to retry it.
>
>  That last sentence is nonsense.  Once the client gets an Access- 
> Reject
> for *any* reason, it is impossible for the proxy to "retry" that  
> request.

*sigh*  Exactly.  Once the client gets an Access-Reject, the NAS has  
told the client that the request is invalid.  An end-user querying the  
NAS gets an error message.  A piece of system software querying the  
NAS gets notified that the account is not valid.  The implication is  
that a retry is futile, even though the account is not actually  
invalid.  The account is perfectly valid.  The proxy just gave up too  
soon (and by too soon, I mean "before it tried more than one of its  
home servers").  I want the proxy to retry the request to a different  
home server precisely to prevent the NAS (and thus the client) from  
getting an Access-Reject when it does not have to.  This is typically  
how load-balancers with failover capability work.  They try their best  
to make sure individual requests succeed when they can.

>  If you want the proxy to fail over, send it more than ONE request  
> at a
> time (like a normal proxying system), and do NOT configure the "do not
> respond" policy.

So my NAS now has to send two separate requests for the same  
authentication, and pick the one that does not come back with an  
Access-Reject?  Which NAS does that?  Or are you saying that my end- 
client has to not accept the fact that he was rejected and keep  
retrying until he either a) gets an accept or b) gets rejected so many  
times he accepts it as gospel?  Either way, it makes no sense.  Either  
way, the proxy is creating a retry loop.

Again, I am not arguing that the proxy will not fail over.  It will  
for subsequent requests.  What a fail-over solution will typically do,  
though, is fail over even for a given single request, so that all  
requests are handled as resiliently as possible.  In other words, a  
NAS does not need to see a single failed request from the proxy for  
the proxy to trigger a failover.

>  The proxy WILL fail over, but due to the imperfect nature of the
> universe, some requests MAY time out and get rejected.  With a better
> detection algorithm, the number of failures might get smaller than  
> it is
> today, but it is IMPOSSIBLE to get the number down to zero.

To a NAS, there is a big difference between a timeout and a reject.   
If it does not get a response, a NAS will typically handle the client  
differently than if it gets an explicit rejection.  Right now, a  
timeout event from the home server results in an explicit rejection  
(unless I configure it not to send that reject).  It IS possible to  
get the number down to zero, because I have used RADIUS software that  
does it.  The only time it should ever be non-zero is if all home  
servers that can possibly be tried in a given window (which might not  
be all of them, but is most likely going to be more than one of them)  
fail to respond.  Like I said, I am trying to migrate to freeradius  
for some other features.  I have used two other proprietary RADIUS  
server software packages that implement this behavior.

>  No.  RADIUS doesn't work like that.  No amount of magic on the proxy
> will cause the NAS to retry forever (which is the only way to have the
> proxy cycle through all home servers for one request).  If you  
> configure
> the NAS to retry forever, then all you will do is push network  
> failures
> off to some other part of the network.

Right.  Precisely.  I want to push the network failure handling to the  
proxy, which has the knowledge that there are multiple points of  
failure.  The NAS does not know that there are 20 possible servers to  
respond to it.  All it knows is that there is 1 RADIUS server it can  
talk to (the proxy) and if the proxy says the request was rejected,  
the request is considered rejected.  The end-client certainly does not  
know what can fail.  The proxy knows that there are 20 servers.  When  
it decides to fail one server out, it KNOWS a) that the proxied  
request was not rejected, it just was not responded to by the home  
server and b) that it can try that request to another home server  
before it tells the NAS that the request is rejected (the request has  
not been rejected, of course, since no home server has responded one  
way or the other yet and until the proxy responds to the NAS, the NAS  
will not know one way or the other).

I also understand that Accept-Challenge can complicate the proxying,  
but that is solvable as well with standard state tracking.

>  This is how IP connectivity works: Networks are imperfect.  There is
> absolutely nothing you can do about that.

I know that networks are imperfect.  The answer to that imperfection  
is to retry, not to give up.  When you tell a NAS that the request has  
been rejected when, in fact, it has not, you are not effectively  
retrying.  You are saying, "Do not retry.  You actually got this  
failed result."

But look, I have gone through the code.  Ivan's right, that there is  
no way to get the behavior I want in freeradius without either a  
module (not sure if this is even possible to accomplish via a module  
because proxying is not handled via a module ) or by hacking the code  
to change how proxy no-responses are handled.  It just frustrates me  
that you challenge the value of this.  For people like me who use  
freeradius not to serve dial gear but to serve as robust  
authentication platforms for on-network services, where sending a  
false rejection to a client is an SLA issue, having a proxy that can  
robustly and transparently handle transient network failures is very  
valuable.  With that, we do not have to reprogram or replace NAS  
software (some of which we cannot control) to handle those kinds of  
transient network failures for us.

Philip