Behavior of FreeRADIUS auth when SQL backend becomes inaccessible

Wed Mar 5 11:21:11 CET 2014

> Setup:
>  
> - each of the 3 servers consists of a FR instance and a MySQL instance that contains all auth data and the radacct table to store acct data
> - each FR instance connects to its local SQL instance only, there’s no redundant or load-balance setup required – going that route and have FR try to connect to the SQL instance on one of the other nodes in case of errors with the local one would also solve our problem, but we deem it unnecessary if FR was able to handle a failed SQL connection “properly"

It can.

> - the SQL nodes are configured as Galera multi-master cluster, so any node is operating on the same set of data
> -> this means every node should work perfectly fine on its own, which is why we don’t think we need to implement the redundancy options that FR offers.

Ok, sure.

> If SQL is unavailable before we start FR, FR refuses to start and exits immediately after it finds out it cannot connect to the local SQL instance, with „Instantiation failed for module sql_localhost“. This behavior is perfectly fine because it means that any NAS client sending requests to that particular FR node will find that the node does not respond, and the client will retry the request with the other RADIUS servers it knows of and hopefully, at least one of them will answer.
>  
> However, if we start FR while  and subsequently shut down the SQL instance, rlm_sql returns a fail, “SQL query error; rejecting user”, and FR subsequently sends a REJECT response to any NAS request it receives, 

redundant {
	sql
	do_not_respond
}

> Actually, FR's current behavior is a bit more irritating to us because we need to use a custom huntgroup SQL query that we placed in an "update request" section right before we (try to) query SQL for auth in the "authorize" section, but instead of two times "fail" we get two different error  codes from the two statements when SQL is unavailable

Post debug output.

> What is the proper way to allow the NAS clients to fail over to another FR node altogether instead of getting misleading and in most cases outright wrong information ("Invalid user" is what FR tells the NAS) from FR? Can we make FR just not reply to the request at all in these cases, or send a request that signals to the NAS that it should try the FR node next door instead because this FR node is unable to make any definitive statement?

The 'do_not_respond' policy. i.e. not responding. You can also use status-server messages, if your vendor has sensibly ignored the RFC 2865 requirements for 'keep alives'.

>  
> And finally, we're forwarding exactly one particular realm to another RADIUS server outside of our administrative control, and while any information FR needs to be able identify these requests as "to-be-proxied" is configured in plaintext files and thus should continue to work if SQL fails, requests for this realm also fail as soon as we shut down SQL, because the explicit REJECT from SQL makes FR not even proxy the request to the home server before telling the NAS that the Login request should be denied.
> Why does FR try to run the query against SQL (i.e. its own authorize section) at all if it knows from config that it should simply forward the request (unmodified even, we don't use pre-proxy or post-proxy at all) and wait for the reply of the home server for this particular realm?

Because future policies may cancel or rewrite the proxy destination.

> The last issue doesn't occur if we put a redundant {sql_localhost; handled} block instead of the single "sql_localhost" statement in the auth section, but I don't know WHY it works

man unlang.

> (it probably causes side effects we don't want), or rather, I figured that somehow the reseller request always gets checked against the local SQL database first (which it shouldn't or at least doesn't need to waste any CPU cycles on as it will never find anything in there about the reseller's customers), no matter whether the SQL connections works or doesn't work, but somehow a "notfound" from SQL leads FR to finally proxy the request to the reseller RADIUS server and get a proper answer, while a "fail" from SQL somehow skips the proxying step and outright denies the request.

Because the server's ability to make correct policy decisions has been compromised by the failure of the SQL database, and so if nothing has rewritten the failure code by the time the request leaves the current section, FreeRADIUS will skip subsequent sections (other than perhaps post-auth) and just return a Access-Reject.

-Arran

Arran Cudbard-Bell <a.cudbardb at freeradius.org>
FreeRADIUS Development Team

FD31 3077 42EC 7FCD 32FE 5EE2 56CF 27F9 30A8 CAA2

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 881 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://lists.freeradius.org/pipermail/freeradius-users/attachments/20140305/25abab84/attachment.pgp>