Behavior of FreeRADIUS auth when SQL backend becomes inaccessible

Wed Mar 5 11:52:38 CET 2014

Patrick Wagner wrote:
> However, if we start FR while  and subsequently shut down the SQL
> instance, rlm_sql returns a fail, “SQL query error; rejecting user”, and
> FR subsequently sends a REJECT response to any NAS request it receives,
> which is not at all the behavior we’d like to see as it means that any
> NAS querying this particular FR node will deny all requests instead of
> retrying the request with another node.

  In 2.2.3, you can use the "do_not_respond" policy.

	sql
	if (fail) {
		do_not_respond

	}

> Actually, FR's current behavior is a bit more irritating to us because
> we need to use a custom huntgroup SQL query that we placed in an "update
> request" section right before we (try to) query SQL for auth in the
> "authorize" section, but instead of two times "fail" we get two
> different error  codes from the two statements when SQL is unavailable:
>
> -  "++ [request] returns notfound"
> 
> and later
> 
> - " ++[sql_localhost] returns fail"

  That's because the dynamic queries don't have a module failure code.
Fixing that involves serious architectural changes.

> Do we need to suppress / rewrite both of them? Suppressing the first one
> is impossible, I think, because in "update request" apparently FR
> doesn't differentiate between a query that was executed returned but
> returned an empty result, and a failed query (because SQL was
> unavailable). The invocation of sql_localhost, right below, does
> differentiate, as it returns fail instead of notfound.

  Yes, because it can't.

> What is the proper way to allow the NAS clients to fail over to another
> FR node altogether instead of getting misleading and in most cases
> outright wrong information ("Invalid user" is what FR tells the NAS)
> from FR? Can we make FR just not reply to the request at all in these
> cases,

  See above.

  The default configuration of the server assumes that it's
authoritative for the user.  So if the DB is down, so is FreeRADIUS.
That works for probably 95% of the deployments.

  The "do_not_respond" policy is there for the other deployments.  But
it does require manual configuration.

> And finally, we're forwarding exactly one particular realm to another
> RADIUS server outside of our administrative control, and while any
> information FR needs to be able identify these requests as
> "to-be-proxied" is configured in plaintext files and thus should
> continue to work if SQL fails, requests for this realm also fail as soon
> as we shut down SQL, because the explicit REJECT from SQL makes FR not
> even proxy the request to the home server before telling the NAS that
> the Login request should be denied.

  I welcome suggestions for a better way to do things.

  Since you're doing local authentication *and* proxying, you should be
aware that they both run in the same RADIUS server.  The requests also
come from one NAS.  So adding a "do_not_respond" policy to local auth
policy, makes the NAS think that the *entire server* is down.  It then
may not send *any* requests to the server.

  That's why FR defaults to sending a reject.  The NAS thinks that the
server is alive, and will continue to send it requests.  Including
requests which need to be proxied.

  There is *no* way around this problem.  There is *no* solution to it.
 RADIUS simply isn't capable of that fine-grained level of distinction
you need.  If you expect it to be capable of that, you're wrong.

> Why does FR try to run the query against SQL (i.e. its own authorize
> section) at all if it knows from config that it should simply forward
> the request (unmodified even, we don't use pre-proxy or post-proxy at
> all) and wait for the reply of the home server for this particular realm?

  Because that's what you told it to do.  It process the "authorize"
section from top to bottom.  Read the debug log, this should ALL be clear.

  If you want it to avoid the SQL query when proxying, configure it to
do that:

	authorize {
		realm
		if (updated) {
			handled
		}

		... everything else ...

	}

> The last issue doesn't occur if we put a redundant {sql_localhost;
> handled} block instead of the single "sql_localhost" statement in the
> auth section, but I don't know WHY it works

  The behavior of the redundant section is documented.  See "man unlang".

> (it probably causes side
> effects we don't want), or rather, I figured that somehow the reseller
> request always gets checked against the local SQL database first 

  Yes, this is obvious from your configuration, and from reading the
debug log.

> (which it shouldn't

  It should, because it's doing what you told it to do.  It's a
computer.  It does you tell it to do.  It's not magic, where it somehow
does what you *need* rather than what you *say*.

> Obviously we don't want the proxy requests to ever get checked locally -
> this would solve this issue completely.

  Then configure the server to do that.  The entire method for *how* to
do this is documented extensively.  How the server works is documented.
 It's also clear from the debug output how the server works.

  That's why we recommend reading the debug output.  It really does help.

  Alan DeKok.