Robust Authentication Proxying

Sun Jul 12 08:58:52 CEST 2009

Philip Molter wrote:
> I apologize I was not more specific.  The retransmits kept getting sent
> to the same failed home server rather than the failed home server being
> marked dead and the retransmits going to a different home server.  I
> have figured out why.  The minimum zombie_period is 20, hard-coded in
> realms.c.  The zombie_period of 5 you recommended which I tried was not
> taking effect, which lead to my 20 second test timeout kicking in before
> the proxy had waited long enough to actually mark the server as dead
> (The 5th retransmit would have triggered the failover, but the proxy
> only got 3 retransmits).

  OK.  So making the zombie period shorter would have made it fail over
sooner.  That's fine.

> And that does exactly what I want for this case.  I can provide a patch
> that does the following things:
> 
> a) allows lower values than 5 for response_window and 20 for
> zombie_period (I will not change recommendations)

  That's fine.  People are free to destroy their own systems if they
don't follow the recommendations.

> b) makes the post_proxy_fail_handler optional on a pool-by-pool basis

  If the early "reject" is wrong, it might be best to just delete it.
Sites with a small number of home servers will still run the
post_proxy_fail_handler, just a little bit later than they do now.

> Does that seem acceptable?  You seem hesitant to accept a solution that
> you do not think could be used for more than a few people.  This
> solution is going to be minimally invasive to the code.

  It seems fine.

> Also, is there a config with which the retransmit proxy failover code
> could actually be triggered without the patch?  I cannot see it. 

  For the original request that started this?  No.  For the other
requests, see other calls to home_server_ldb(), around line 2580 of event.c.

> Failover only happens after the response_window is exceeded, and if the
> response_window is exceeded, the original request is replied to with an
> Access-Reject message, which means any retransmits will be never reach
> the REQUEST_PROXIED state in received_retransmits() after the
> response_window is exceeded.  Am I reading that correctly?

 Yes... but the suggested patch *deletes* the code that makes the
request fail over "response_window" is exceeded.  So... that request
should fail over to another home server, if zombie_period is set low enough.

> You had the retransmit failover code already written.  It seems not much
> needs to be done to allow a pool configuration to continue on after the
> response_window has been exceeded.  Let me submit a patch and you see
> what you think.

  OK.

> Well, there's nothing in the RADIUS specification that describes or even
> recommends how a lack of response must be handled by the NAS.

  The RADIUS specifications are missing a whole lot of things.  Like
"interim updates should be sent from the same source IP as the
start/stop packets".  Obvious, right?  Well... there are products that
violate this.

>  You make
> it sound as if the NAS is doing something illegal by using a previous
> cached accept.  It's not.

  I said it's "outside of RADIUS".  It does not follow the RADIUS
operational model, which is that the RADIUS server authenticates users.
 If the NAS caches credentials, it's authenticating users via a
non-RADIUS method.

>  The NAS can implement whatever logic it
> wants, and that particular feature is one that leads to a better user
> experience.  Just because you think a failure-to-contact is the same as
> a denial does not mean that other vendors have not come up with
> solutions that can work around it.

  It's a "fail-safe" security practice.  The alternative is to take the
RADIUS server down, and then what?  Does the NAS let everyone on the
network?  What happens if their credentials have been cached, but they
haven't paid their bills?

  The work-around you're talking about is *very* site-specific.  ISP's
and telcos would go crazy if NAS vendors implemented it.  They could
lose a lot of money...

> RFC 2607 is clear that the proxy should not respond to the client unless
> it receives a reply from the home server.  At the very least, returning
> a rejection is not an accurate portrayal of the state of the
> authentication.  It would be a better representation to just let it
> timeout, but I understand returning the rejection so that the NAS can
> short-circuit more quickly the transaction.

  Not returning a response also makes the NAS think that the proxy is
down, when it's not.  See the status-server draft for a discussion of
this issue, and a solution:

http://tools.ietf.org/internet-drafts/draft-ietf-radext-status-server

  Yes, I'm not just a random opinionated guy on the net.  I have 4-5
RADIUS specifications either published, or on track to be published.

  Alan DeKok.