Behavior of FreeRADIUS auth when SQL backend becomes inaccessible

Wed Mar 5 10:39:27 CET 2014

Hello,

we've got a problem with FreeRADIUS / rlm_sql behavior that prevents us from
enjoying a fully redundant RADIUS service, consisting of 3 FreeRADIUS
servers, probably because of compounded mistakes / misunderstandings in our
configuration that I've attached at the bottom of this post.

Setup:

- each of the 3 servers consists of a FR instance and a MySQL instance that
contains all auth data and the radacct table to store acct data

- each FR instance connects to its local SQL instance only, there's no
redundant or load-balance setup required - going that route and have FR try
to connect to the SQL instance on one of the other nodes in case of errors
with the local one would also solve our problem, but we deem it unnecessary
if FR was able to handle a failed SQL connection "properly"

- the SQL nodes are configured as Galera multi-master cluster, so any node
is operating on the same set of data

-> this means every node should work perfectly fine on its own, which is why
we don't think we need to implement the redundancy options that FR offers.

If SQL is unavailable before we start FR, FR refuses to start and exits
immediately after it finds out it cannot connect to the local SQL instance,
with "Instantiation failed for module sql_localhost". This behavior is
perfectly fine because it means that any NAS client sending requests to that
particular FR node will find that the node does not respond, and the client
will retry the request with the other RADIUS servers it knows of and
hopefully, at least one of them will answer.

However, if we start FR while  and subsequently shut down the SQL instance,
rlm_sql returns a fail, "SQL query error; rejecting user", and FR
subsequently sends a REJECT response to any NAS request it receives, which
is not at all the behavior we'd like to see as it means that any NAS
querying this particular FR node will deny all requests instead of retrying
the request with another node. I've seen a post on this list by Alan DeKok
suggesting that "fail = ok" (or = invalid or whatever we should use in this
case) is proper unlang or at least was proper syntax back in 2007, but with
FR 2.1 (the respective version packaged with Ubuntu 10.04 LTS and CentOS
6.5) I was unable to start FR with such a statement added in either the
"authorize" or "post-auth" section, declaring "Unknown action 'invalid'"
when parsing the config.

Respective thread which I hoped would give me some pointers:

http://lists.freeradius.org/pipermail/freeradius-users/2007-December/024055.
html

Reply by Alan DeKok which I went on to try, unsuccessfully, with the error
message quoted above:

http://lists.freeradius.org/pipermail/freeradius-users/2007-December/024059.
html

Actually, FR's current behavior is a bit more irritating to us because we
need to use a custom huntgroup SQL query that we placed in an "update
request" section right before we (try to) query SQL for auth in the
"authorize" section, but instead of two times "fail" we get two different
error  codes from the two statements when SQL is unavailable:

-  "++ [request] returns notfound"

and later

- " ++[sql_localhost] returns fail"

Do we need to suppress / rewrite both of them? Suppressing the first one is
impossible, I think, because in "update request" apparently FR doesn't
differentiate between a query that was executed returned but returned an
empty result, and a failed query (because SQL was unavailable). The
invocation of sql_localhost, right below, does differentiate, as it returns
fail instead of notfound.

What is the proper way to allow the NAS clients to fail over to another FR
node altogether instead of getting misleading and in most cases outright
wrong information ("Invalid user" is what FR tells the NAS) from FR? Can we
make FR just not reply to the request at all in these cases, or send a
request that signals to the NAS that it should try the FR node next door
instead because this FR node is unable to make any definitive statement?

And finally, we're forwarding exactly one particular realm to another RADIUS
server outside of our administrative control, and while any information FR
needs to be able identify these requests as "to-be-proxied" is configured in
plaintext files and thus should continue to work if SQL fails, requests for
this realm also fail as soon as we shut down SQL, because the explicit
REJECT from SQL makes FR not even proxy the request to the home server
before telling the NAS that the Login request should be denied.

Why does FR try to run the query against SQL (i.e. its own authorize
section) at all if it knows from config that it should simply forward the
request (unmodified even, we don't use pre-proxy or post-proxy at all) and
wait for the reply of the home server for this particular realm?

The last issue doesn't occur if we put a redundant {sql_localhost; handled}
block instead of the single "sql_localhost" statement in the auth section,
but I don't know WHY it works (it probably causes side effects we don't
want), or rather, I figured that somehow the reseller request always gets
checked against the local SQL database first (which it shouldn't or at least
doesn't need to waste any CPU cycles on as it will never find anything in
there about the reseller's customers), no matter whether the SQL connections
works or doesn't work, but somehow a "notfound" from SQL leads FR to finally
proxy the request to the reseller RADIUS server and get a proper answer,
while a "fail" from SQL somehow skips the proxying step and outright denies
the request. 

Obviously we don't want the proxy requests to ever get checked locally -
this would solve this issue completely.

I've attached the configuration we use currently below - if you need radius
-X output for a particular scenario, let me know. 

- Patrick Wagner

__________________________

Configuration of the only sites-enabled/ site document:

authorize {

        update request {

                Huntgroup-Name := "%{sql_localhost:select groupname from
radhuntgroup where nasipaddress=\"%{NAS-IP-Address}\"}"

        }

        preprocess

        chap

        mschap

        realmraute # realms are differentiated by #suffix

        eap {

                ok = return

        }

        sql_localhost

# the following allows reseller requests to get proxied to the correct home
server, but with unknown side effects

#        redundant {

#                sql_localhost

#                handled

#       }

        expiration

        logintime

        pap

}

authenticate {

        Auth-Type PAP {

                pap

        }

        Auth-Type CHAP {

                chap

        }

        Auth-Type MS-CHAP {

                mschap

        }

        unix

        eap

}

preacct {

        preprocess

        acct_unique

        realmraute

}

accounting {

        detail {

                fail = 1

        }

        unix

        radutmp

        sql_localhost

        attr_filter.accounting_response

}

session {

        sql_localhost

}

post-auth {

        exec

        Post-Auth-Type REJECT {

                attr_filter.access_reject

        }

}

pre-proxy {

}

post-proxy {

}

______________________

Configuration of radiusd.conf:

prefix = /usr

exec_prefix = /usr

sysconfdir = /etc

localstatedir = /var

sbindir = ${exec_prefix}/sbin

logdir = /var/log/freeradius

raddbdir = /etc/freeradius

radacctdir = ${logdir}/radacct

name = freeradius

confdir = ${raddbdir}

run_dir = ${localstatedir}/run/${name}

db_dir = ${raddbdir}

libdir = /usr/lib/freeradius

pidfile = ${run_dir}/${name}.pid

user = freerad

group = freerad

max_request_time = 30

cleanup_delay = 5

max_requests = 1024

listen {

        type = auth

        ipaddr = 2.2.2.2

        port = 1645

}

listen {

        ipaddr = 2.2.2.2

        port = 1646

        type = acct

}

listen {

        ipaddr = 192.168.5.2

        port = 1645

        type = auth

}

listen {

        ipaddr = 192.168.5.2

        port = 1646

        type = acct

}

hostname_lookups = no

allow_core_dumps = no

regular_expressions     = yes

extended_expressions    = yes

log {

        destination = files

        file = ${logdir}/radius.log

        syslog_facility = daemon

        stripped_names = no

        auth = no

        auth_badpass = no

        auth_goodpass = no

}

checkrad = ${sbindir}/checkrad

security {

        max_attributes = 200

        reject_delay = 1

        status_server = yes

}

proxy_requests  = yes

$INCLUDE proxy.conf

thread pool {

        start_servers = 5

        max_servers = 50

        min_spare_servers = 3

        max_spare_servers = 10

        max_requests_per_server = 0

}

modules {

        $INCLUDE ${confdir}/modules/

        $INCLUDE eap.conf

        $INCLUDE sql_localhost.conf

        $INCLUDE sql/mysql/counter.conf

}

instantiate {

        exec

        expr

        expiration

        logintime

}

$INCLUDE policy.conf

$INCLUDE sites-enabled/

_________________

Configuration of proxy.conf:

proxy server {

        synchronous = no

        retry_delay = 5

        retry_count = 3

        dead_time = 120

        default_fallback = no

        post_proxy_authorize = no

}

realm resrealm at domain.tld <mailto:resrealm at domain.tld>  {

       type            = radius

       authhost        = 10.10.10.10:1645

       accthost        = 10.10.10.10:1646

       secret          = respw

       nostrip

}

realm LOCAL {

        type            = radius

        authhost        = LOCAL

        accthost        = LOCAL

}

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freeradius.org/pipermail/freeradius-users/attachments/20140305/7f1ac77a/attachment-0001.html>