Robust Authentication Proxying
Philip Molter
hrunting at hrunting.org
Sat Jul 11 16:04:06 CEST 2009
On Jul 11, 2009, at 2:14 AM, Alan DeKok wrote:
> Philip Molter wrote:
>> Yes, this is the configuration I'm currently running, and it's not
>> working for me. I have a radclient sending a request, retrying 10
>> times
>> on a 5-second timer, and after 10 retries, it still hasn't gotten a
>> response. After the second retry, the proxy has marked the server
>> as at
>> least a zombie and started status-checks, but every retransmit after
>> that is getting a cached result of no response.
>
> Could you possibly try READING my messages?
>
> The default configuration does NOT include the "do_not_respond"
> policy. *YOU* are the one who configured that, as I have said
> multiple
> times.
>
> If you don't want it to get the cached "do not respond", policy, then
>
> DON'T CONFIGURE IT
>
> It's that easy.
I do not want to get ANY cached response. I do not want to get any
Access-Reject. If I do not configure a 'do not respond' response, I
get an Access-Reject, which is even worse because my end-client gets
an error when he should not. What I want is for a no-response from a
home server to be treated as a no-response to the NAS, and the
subsequent retransmit from the NAS to be processed as a retransmit to
a different home server.
>> This is what I want to happen
>>
>> client req -> proxy
>> proxy req -> home server #1
>> client ret -> proxy
>> proxy ret -> home server #1
>> [proxy fails home server #1 for lack of response]
>> client ret -> proxy
>> proxy req -> home server #2
>> proxy <- resp home server #2
>> client <- resp proxy
>
> It does that (mostly). But only if you don't break the server.
No, it does not do that all. I have yet to see a retransmit from a
client actually get tried on a different server than the one used for
the original request. Once the proxy fails to receive a response from
the originally chosen home-server, it handles the packet as a
failure. If it sends back an Access-Reject packet, the request is
rejected by the NAS to the client and that NAS stops retrying THAT
REQUEST (ie. the end-client gets an error). If I add a configuration
to not send back anything, then the NAS will retransmit, but like you
have made abundantly clear, the proxy remembers that you sent back no
response to the original request and skips all further processing of
the retransmit.
I have set response_windows and zombie_periods to minimums. I have
set response_windows and zombie_periods to maximums. For a given
single request, only one home server is tried, and if that home server
is down, the request and any other retransmits of that request will
not succeed. Yes, if the NAS sends another separate request with a
different ID, it will be proxied to a different home server, but that
does not help the poor guy who had the hard luck of his request
hitting the bad home server. He will get an error message. He will
have to retry or call support or whatever.
>> This is what happens without a post-proxy config:
>>
>> client req -> proxy
>> proxy req -> home server #1
>> client ret -> proxy
>> proxy ret -> home server #1
>> [proxy fails home server #1 for lack of response]
>> client <- rej proxy
>
> That happens for the most part because you played with the
> configuration to make the proxy timeouts super-short. As I said,
> don't
> do that.
It does not matter whether the timeouts are short or long. This
always happens. See my note above.
In fact, no matter what I set the timeouts to, it always seems to fail
the server and reject the request after the first retransmit to the
proxy (2 packets, about 10 seconds, regardless of the response_window
or zombie_period settings). Yes, a subsequent, different request will
go to a different home server, but, again, I want to use the proxy to
provide smarter resiliency across a pool of servers. If you know of
settings for response_window and zombie_period that I can use that
will provide the behavior in my "this is what I want to happen"
example, could you provide them please? Because all of the settings I
use seem to result in the same behavior.
>> Okay, so I obviously do not understand how I can tweak
>> response_window
>> and zombie_period to make sure that requests that can be serviced by
>> many possible RADIUS home servers do not return an Access-Reject when
>> one of those home servers does not respond.
>
> i.e. you want NO request to fail processing when a home server fails.
>
> This is extremely difficult to do. Any naive approach that has quick
> failover can have other negative side-effects. (Additional network
> traffic, system load, duplicate processing of requests, etc.)
I guess I do not see those as negatives. That is exactly what I want
to happen. RADIUS network traffic is tiny. The system load created
by sending multiple requests to a home server or a bunch of home
servers is minimal. I am not seeing how you are adding any more load
when instead, the proxy sends back an Access-Reject, which, in the
best case scenario, will result in the end-client re-authenticating,
generating yet another request. In the worst case scenario, the
client accepts the reject as validation that their account cannot be
authorized and presents the wrong result to the end-user (whether that
be a guy sitting on the end of a dial-up line or a piece of system
software trying to determine whether an account is valid). All you
are doing is pushing the logic for retrying from a machine that knows
the there are multiple possible home servers that can respond to a
machine that does not via a response that says, effectively, "Do not
retry. Your request is invalid."
Your argument that the RADIUS server cannot handle a retry does not
hold water to me, but regardless, I can envision configurations where
you would want to minimize all processing by the RADIUS proxy itself
(most machines now have way more processing power than a simple RADIUS
proxy can consume, so that is not a common need anymore). I wish the
option was available. There seem to be knobs for a lot of other things.
>> The client sends a request to the proxy. If a home server does not
>> respond within a short period of time to the request, a second home
>> server is chosen. If the second home server does not respond to the
>> same request, then a third is chosen. This continues until all
>> possible
>> home servers are exhausted. At that point, an Access-Reject packet
>> is
>> sent back to the client. Otherwise, the response from the home
>> server
>> is sent back to the client.
>
> Doing that requires source code mods, because that quick fail-over
> can
> have negative side-effects. i.e. The server does NOT support
> configurations that can negatively affect it's performance.
See my note above for why the work to be done by the server is no more
or no less than just returning a reject once the timeout is hit. You
are either going to be processing more retries to the home server or
more retries from the NAS. Either way, you are going to increase your
load.
> On top of that, the "try all possible home servers" is impossible.
> There is ALSO a 30 second lifetime for the request. After 30 seconds,
> the NAS has given up, so failing over to another home server is
> useless.
>
> On top of that, the NAS will only retry 3-6 times. So if you have 19
> home servers, at *best* it would fail over to 3-6 of them, before the
> request is marked "timeout".
Okay, AT BEST you get 3-6 different home servers in a 30-second
period. Right now, AT BEST I get 1. Which method is more resilient?
Which method results in no false rejections being returned to the
NAS? The worst that can happen is that the NAS gets no response,
which is exactly what would happen if the NAS queried that one home
server directly. The proxy can even be smart about it and only retry
to a different home server when the NAS retransmits (which I believe
it already does), so if the NAS stops retransmitting because it has
given up, so does the proxy, but please, let the NAS give up first.
The proxy does not know how many times that the NAS will retry. I
have my NASes configured to retry for up to 60 seconds, once every 2
seconds. They will retry 30 times. It is more important to me that
authentication requests succeed, even if they succeed slowly. It
sounds to me like freeradius is making assumptions about how NASes
should work, and as a result, reducing the flexibility it provides.
> I sincerely hope you see now that the situation is rather more
> complicated than the simple "try all home server" statement.
>
>> How do I configure that? It doesn't seem to matter what I set
>> response_window or zombie_period to, once the first home server
>> fails to
>> respond, an Access-Reject (or nothing if I configure a post-proxy
>> handler) is returned to the client. My client's not going to retry
>> the
>> request if he gets an Access-Reject, so I need the proxy to retry it.
>
> That last sentence is nonsense. Once the client gets an Access-
> Reject
> for *any* reason, it is impossible for the proxy to "retry" that
> request.
*sigh* Exactly. Once the client gets an Access-Reject, the NAS has
told the client that the request is invalid. An end-user querying the
NAS gets an error message. A piece of system software querying the
NAS gets notified that the account is not valid. The implication is
that a retry is futile, even though the account is not actually
invalid. The account is perfectly valid. The proxy just gave up too
soon (and by too soon, I mean "before it tried more than one of its
home servers"). I want the proxy to retry the request to a different
home server precisely to prevent the NAS (and thus the client) from
getting an Access-Reject when it does not have to. This is typically
how load-balancers with failover capability work. They try their best
to make sure individual requests succeed when they can.
> If you want the proxy to fail over, send it more than ONE request
> at a
> time (like a normal proxying system), and do NOT configure the "do not
> respond" policy.
So my NAS now has to send two separate requests for the same
authentication, and pick the one that does not come back with an
Access-Reject? Which NAS does that? Or are you saying that my end-
client has to not accept the fact that he was rejected and keep
retrying until he either a) gets an accept or b) gets rejected so many
times he accepts it as gospel? Either way, it makes no sense. Either
way, the proxy is creating a retry loop.
Again, I am not arguing that the proxy will not fail over. It will
for subsequent requests. What a fail-over solution will typically do,
though, is fail over even for a given single request, so that all
requests are handled as resiliently as possible. In other words, a
NAS does not need to see a single failed request from the proxy for
the proxy to trigger a failover.
> The proxy WILL fail over, but due to the imperfect nature of the
> universe, some requests MAY time out and get rejected. With a better
> detection algorithm, the number of failures might get smaller than
> it is
> today, but it is IMPOSSIBLE to get the number down to zero.
To a NAS, there is a big difference between a timeout and a reject.
If it does not get a response, a NAS will typically handle the client
differently than if it gets an explicit rejection. Right now, a
timeout event from the home server results in an explicit rejection
(unless I configure it not to send that reject). It IS possible to
get the number down to zero, because I have used RADIUS software that
does it. The only time it should ever be non-zero is if all home
servers that can possibly be tried in a given window (which might not
be all of them, but is most likely going to be more than one of them)
fail to respond. Like I said, I am trying to migrate to freeradius
for some other features. I have used two other proprietary RADIUS
server software packages that implement this behavior.
> No. RADIUS doesn't work like that. No amount of magic on the proxy
> will cause the NAS to retry forever (which is the only way to have the
> proxy cycle through all home servers for one request). If you
> configure
> the NAS to retry forever, then all you will do is push network
> failures
> off to some other part of the network.
Right. Precisely. I want to push the network failure handling to the
proxy, which has the knowledge that there are multiple points of
failure. The NAS does not know that there are 20 possible servers to
respond to it. All it knows is that there is 1 RADIUS server it can
talk to (the proxy) and if the proxy says the request was rejected,
the request is considered rejected. The end-client certainly does not
know what can fail. The proxy knows that there are 20 servers. When
it decides to fail one server out, it KNOWS a) that the proxied
request was not rejected, it just was not responded to by the home
server and b) that it can try that request to another home server
before it tells the NAS that the request is rejected (the request has
not been rejected, of course, since no home server has responded one
way or the other yet and until the proxy responds to the NAS, the NAS
will not know one way or the other).
I also understand that Accept-Challenge can complicate the proxying,
but that is solvable as well with standard state tracking.
> This is how IP connectivity works: Networks are imperfect. There is
> absolutely nothing you can do about that.
I know that networks are imperfect. The answer to that imperfection
is to retry, not to give up. When you tell a NAS that the request has
been rejected when, in fact, it has not, you are not effectively
retrying. You are saying, "Do not retry. You actually got this
failed result."
But look, I have gone through the code. Ivan's right, that there is
no way to get the behavior I want in freeradius without either a
module (not sure if this is even possible to accomplish via a module
because proxying is not handled via a module ) or by hacking the code
to change how proxy no-responses are handled. It just frustrates me
that you challenge the value of this. For people like me who use
freeradius not to serve dial gear but to serve as robust
authentication platforms for on-network services, where sending a
false rejection to a client is an SLA issue, having a proxy that can
robustly and transparently handle transient network failures is very
valuable. With that, we do not have to reprogram or replace NAS
software (some of which we cannot control) to handle those kinds of
transient network failures for us.
Philip
More information about the Freeradius-Users
mailing list