Re: Behavior of FreeRADIUS auth when SQL backend becomes inaccessible

5 Mar 2014

      Patrick Wagner wrote:
...
However, if we start FR while  and subsequently shut down the SQL
instance, rlm_sql returns a fail, “SQL query error; rejecting user”, and
FR subsequently sends a REJECT response to any NAS request it receives,
which is not at all the behavior we’d like to see as it means that any
NAS querying this particular FR node will deny all requests instead of
retrying the request with another node.
In 2.2.3, you can use the "do_not_respond" policy.

	sql
	if (fail) {
		do_not_respond

	}
...
Actually, FR's current behavior is a bit more irritating to us because
we need to use a custom huntgroup SQL query that we placed in an "update
request" section right before we (try to) query SQL for auth in the
"authorize" section, but instead of two times "fail" we get two
different error  codes from the two statements when SQL is unavailable:
-  "++ [request] returns notfound"
and later
- " ++[sql_localhost] returns fail"
That's because the dynamic queries don't have a module failure code.
Fixing that involves serious architectural changes.
...
Do we need to suppress / rewrite both of them? Suppressing the first one
is impossible, I think, because in "update request" apparently FR
doesn't differentiate between a query that was executed returned but
returned an empty result, and a failed query (because SQL was
unavailable). The invocation of sql_localhost, right below, does
differentiate, as it returns fail instead of notfound.
Yes, because it can't.
...
What is the proper way to allow the NAS clients to fail over to another
FR node altogether instead of getting misleading and in most cases
outright wrong information ("Invalid user" is what FR tells the NAS)
from FR? Can we make FR just not reply to the request at all in these
cases,
See above.

  The default configuration of the server assumes that it's
authoritative for the user.  So if the DB is down, so is FreeRADIUS.
That works for probably 95% of the deployments.

  The "do_not_respond" policy is there for the other deployments.  But
it does require manual configuration.
...
And finally, we're forwarding exactly one particular realm to another
RADIUS server outside of our administrative control, and while any
information FR needs to be able identify these requests as
"to-be-proxied" is configured in plaintext files and thus should
continue to work if SQL fails, requests for this realm also fail as soon
as we shut down SQL, because the explicit REJECT from SQL makes FR not
even proxy the request to the home server before telling the NAS that
the Login request should be denied.
I welcome suggestions for a better way to do things.

  Since you're doing local authentication *and* proxying, you should be
aware that they both run in the same RADIUS server.  The requests also
come from one NAS.  So adding a "do_not_respond" policy to local auth
policy, makes the NAS think that the *entire server* is down.  It then
may not send *any* requests to the server.

  That's why FR defaults to sending a reject.  The NAS thinks that the
server is alive, and will continue to send it requests.  Including
requests which need to be proxied.

  There is *no* way around this problem.  There is *no* solution to it.
 RADIUS simply isn't capable of that fine-grained level of distinction
you need.  If you expect it to be capable of that, you're wrong.
...
Why does FR try to run the query against SQL (i.e. its own authorize
section) at all if it knows from config that it should simply forward
the request (unmodified even, we don't use pre-proxy or post-proxy at
all) and wait for the reply of the home server for this particular realm?
Because that's what you told it to do.  It process the "authorize"
section from top to bottom.  Read the debug log, this should ALL be clear.

  If you want it to avoid the SQL query when proxying, configure it to
do that:

	authorize {
		realm
		if (updated) {
			handled
		}

		... everything else ...

	}
...
The last issue doesn't occur if we put a redundant {sql_localhost;
handled} block instead of the single "sql_localhost" statement in the
auth section, but I don't know WHY it works
The behavior of the redundant section is documented.  See "man unlang".
...
(it probably causes side
effects we don't want), or rather, I figured that somehow the reseller
request always gets checked against the local SQL database first
Yes, this is obvious from your configuration, and from reading the
debug log.
...
(which it shouldn't
It should, because it's doing what you told it to do.  It's a
computer.  It does you tell it to do.  It's not magic, where it somehow
does what you *need* rather than what you *say*.
...
Obviously we don't want the proxy requests to ever get checked locally -
this would solve this issue completely.
Then configure the server to do that.  The entire method for *how* to
do this is documented extensively.  How the server works is documented.
 It's also clear from the debug output how the server works.

  That's why we recommend reading the debug output.  It really does help.

  Alan DeKok.