Robust Authentication Proxying

Sat Jul 11 09:14:18 CEST 2009

Philip Molter wrote:
> Yes, this is the configuration I'm currently running, and it's not
> working for me.  I have a radclient sending a request, retrying 10 times
> on a 5-second timer, and after 10 retries, it still hasn't gotten a
> response.  After the second retry, the proxy has marked the server as at
> least a zombie and started status-checks, but every retransmit after
> that is getting a cached result of no response.

  Could you possibly try READING my messages?

  The default configuration does NOT include the "do_not_respond"
policy.  *YOU* are the one who configured that, as I have said multiple
times.

  If you don't want it to get the cached "do not respond", policy, then

	DON'T CONFIGURE IT

  It's that easy.

> This is what I want to happen
> 
> client req ->  proxy
>                proxy req ->  home server #1
> client ret ->  proxy
>                proxy ret ->  home server #1
>               [proxy fails home server #1 for lack of response]
> client ret ->  proxy
>                proxy req ->  home server #2
>                proxy <- resp home server #2
> client <- resp proxy

  It does that (mostly).  But only if you don't break the server.

> This is what happens without a post-proxy config:
> 
> client req ->  proxy
>                proxy req -> home server #1
> client ret ->  proxy
>                proxy ret -> home server #1
>               [proxy fails home server #1 for lack of response]
> client  <- rej proxy

  That happens for the most part because you played with the
configuration to make the proxy timeouts super-short.  As I said, don't
do that.

  And as I also said, it takes time to determine that a home server is
dead.  During this time, the request MAY time out.  When a request times
out, the NAS has likely given up on it, so failing over to another home
server is useless.

> My config is not marking any request as failed.  If I do not configure
> anything for Post-Proxy-Type, I get back an Access-Reject right when the
> first home server fails.  There is no failover.  The comments in
> proxy.conf make that clear:

  Yes... that's for ONE request.  Most proxies handle more than one
request during any 30-60 second period.  The OTHER requests will fail
over to other home servers, so long as they are still within their
individual lifetime.

> In other words, if the server the load-balance solution happens to
> choose doesn't respond to my request, tough luck.

  And I explained why, and how this happens.  Did you read those
explanations?

>  I might have 19 other
> servers configured that are up, the request I just sent is getting an
> Access-Reject.

  Again... because it takes TIME to determine that the proxy is down...
and during that time, the request times out.

>  The Post-Proxy-Type is just a hack to at least not send
> back an Access-Reject which breaks the whole process.

  Yes... because the "do not respond" policy tells it DO NOT RESPOND.
If you want the server to respond, don't configure the DO NOT RESPOND.

  That much should be pretty obvious.

> Okay, so I obviously do not understand how I can tweak response_window
> and zombie_period to make sure that requests that can be serviced by
> many possible RADIUS home servers do not return an Access-Reject when
> one of those home servers does not respond.

  i.e. you want NO request to fail processing when a home server fails.

  This is extremely difficult to do.  Any naive approach that has quick
failover can have other negative side-effects.  (Additional network
traffic, system load, duplicate processing of requests, etc.)

> The client sends a request to the proxy.  If a home server does not
> respond within a short period of time to the request, a second home
> server is chosen.  If the second home server does not respond to the
> same request, then a third is chosen.  This continues until all possible
> home servers are exhausted.  At that point, an Access-Reject packet is
> sent back to the client.  Otherwise, the response from the home server
> is sent back to the client.

  Doing that requires source code mods, because that quick fail-over can
have negative side-effects.  i.e. The server does NOT support
configurations that can negatively affect it's performance.

  On top of that, the "try all possible home servers" is impossible.
There is ALSO a 30 second lifetime for the request.  After 30 seconds,
the NAS has given up, so failing over to another home server is useless.

  On top of that, the NAS will only retry 3-6 times.  So if you have 19
home servers, at *best* it would fail over to 3-6 of them, before the
request is marked "timeout".

  I sincerely hope you see now that the situation is rather more
complicated than the simple "try all home server" statement.

> How do I configure that?  It doesn't seem to matter what I set
> response_window or zombie_period to, once the first home server fails to
> respond, an Access-Reject (or nothing if I configure a post-proxy
> handler) is returned to the client.  My client's not going to retry the
> request if he gets an Access-Reject, so I need the proxy to retry it.

  That last sentence is nonsense.  Once the client gets an Access-Reject
for *any* reason, it is impossible for the proxy to "retry" that request.

  If you want the proxy to fail over, send it more than ONE request at a
time (like a normal proxying system), and do NOT configure the "do not
respond" policy.

  The proxy WILL fail over, but due to the imperfect nature of the
universe, some requests MAY time out and get rejected.  With a better
detection algorithm, the number of failures might get smaller than it is
today, but it is IMPOSSIBLE to get the number down to zero.

> Is that possible?

  No.  RADIUS doesn't work like that.  No amount of magic on the proxy
will cause the NAS to retry forever (which is the only way to have the
proxy cycle through all home servers for one request).  If you configure
the NAS to retry forever, then all you will do is push network failures
off to some other part of the network.

  This is how IP connectivity works: Networks are imperfect.  There is
absolutely nothing you can do about that.

  Alan DeKok.