Access-Reject (code 3) on an otherwise valid Accept packet / delayed reject when SQL module fails in post-auth

Fri Oct 28 16:52:20 CEST 2011

Jason Antman wrote:

  Ah... a long, detailed message with useful information.  Very nice.

> I'm have a FreeRADIUS 2.1.10 server on CentOS... with a backend MySQL
> database (logging) which is, as Alan would say, Broken - at peak usage
> times it's horribly overloaded. Up until yesterday, our post-auth
> section had a "sql" call not wrapped in a "redundant { sql / ok }".  It
> seems that when the MySQL server gets overloaded, and the sql module
> fails/times out (can't find a configuration value for timeout, or a note
> on what the timeout actually is), the post-auth section stops and
> returns a Reject packet. Since it stops before attr_filter, it's
> returning a non-RFC-compliant REJECT..

  Hmm... you mean an Access-Reject with a bunch of attributes?

> For the short-term, I wrapped the sql calls in post-auth and
> post-auth-type reject in redundant, followed by "ok", so that users will
> not get rejected because of failed logging (administrative/managerial
> decision, not mine) and gave up SQL post-auth logging in favor of not
> locking out users when the DB is struggling.

  Yeah... it might be better to use something like rsyslog, which will
put syslog messages into a DB.  If the DB is busy, it just buffers the
messages to disk.

> Using the test described
> below, this appears to solve the problem of the improperly formed REJECT
> packets, but it still ends up with lots of "There are no DB handles to
> use! skipped 0, tried to connect 0" and then lots of "Info: WARNING
> Child is hung for request XXXX in component post-auth module sql."
> lines... I can only assume that these are a non-critical issue (i.e. the
> child is hung but will recover), but 1) I would like to set a much lower
> database timeout, i.e. have the sql module fail sooner and move on, and
> 2) as the "Login (OK|incorrect)" messages don't have the request number,
> it's difficult to associate these with a given client.

  OK.  There is no "database timeout" config.  Or if there is one...
it's buried in the MySQL documentation where I can't find it.

> So a few questions:
> 1) The logging for this is flat-out wrong. When the SQL module fails in
> this way in post-auth (because of timeout), I'm still seeing "Auth:
> Login OK" in radius.log. I guess this could be "a matter of opinion",
> but the fact that I'm seeing a log message for OK but getting a reject
> out on the wire, seems like a bug.

  It's a timing bug.  The server core says "Login OK" because the
authenticate section succeeded.  If something fails after that... it's a
separate issue.

  The solution is to move the "Login OK" message to a module, and put it
as the last thing in post-auth.  I don't think it's hard, it's just that
no one has done it.

> IMHO, the logging of ok/incorrect
> should either a) be delayed until JUST before the packet goes out,
> beyond where an Accept could turn into a Reject, or b) if/when an Accept
> packet is turned into a Reject and sent out on the wire, there should be
> an explicit log message stating such (as opposed to just the current
> "Sending delayed reject" which appears only in debug logs).

  Well... feel free to submit a patch.

> 2) Is it the expected behavior that if a module in post-auth fails, the
> packet will be changed to a reject without passing through
> Post-Auth-Type REJECT, therefore without passing through attr_filter,
> therefore sending out a non-RFC-compliant Access-Reject packet?

  For now... yes.  The server doesn't re-process a section based on
problems.

> 3) As far as I can tell, this entire issue does *not* happen when
> running "radiusd -fXx"... in that case, radiusd will simply hang
> indefinitely until the DB recovers (or the table is unlocked, in
> testing). Does this sound right?

  Yup.  It's single threaded.  If the DB never returns, the server never
returns.

> Does anyone know if rlm_sql_mysql will
> behave differently in this way in debug mode as opposed to normal daemon
> mode?

  It won't.

> And does it seem correct that in debug mode, there doesn't seem to
> be a timeout for mysql

  There's no timeout for MySQL *ever*.

  In threaded mode, the server core notices that a thread is blocked and
kills it.  But that has nothing to do with MySQL.  The same check is
done for "exec wait" programs that never return.

> 4) As an example of log output, see the lines below. I know that the
> 15:19:30 "Login OK" for client mc4k-121 port 8195 actually left as an
> Access-Reject packet (code 3), which still had User-Name, Callback-Id,
> Session-Timeout, Termination-Action, and two VSAs set, as the proper
> Accept for that user should. In certain instances with this problem, the
> SQL logging was queued somewhere (presumably on the MySQL side, waiting
> for the unlock) and actually made it to the DB, listed as an Accept
> packet and all, after radiusd sent out the incompliant Reject.

  Yup....

> However, of concern in the logs below is the fact that 3 of those Login
> OK lines (outer tunnel) were actually the invalid Access-Reject packets
> on the wire, but there's no indication of that here. While there is a
> vague "There are no DB handles to use" message, that doesn't specify
> which request it relates to (hence no way to tell which Login OK was
> actually rejected), nor does it explicitly say that the request will be
> rejected.

  Configure a "linelog" module to write "login OK" as the last entry of
post-auth.  Make it so that the server core *doesn't* do authlog.

> So to summarize what I'm asking for/about:
> 1) In instances of a "delayed reject" caused by a failure in post-auth,
> would it be possible to have this clearly logged to radius.log, either
> instead of "Login OK", or something along the lines of "WARNING -
> delayed reject of previous request (request id, username, Calling
> Station Id, client, port)"?

  Maybe.  It will likely need source code changes.

> 2) Is there any way to maintain attr_filter.access_reject even with a
> delayed reject because of module failure in post-auth, so that these
> packets are still RFC-compliant?

  Sure... source code changes.  :(

> 3) Is there any way to configure the rlm_sql_mysql timeout, or whatever
> is causing this module to fail when the table is locked?

  Nope.  Or, look at the MySQL client API documentation.  If there's a
timeout, we can poke rlm_sql_mysql to use it.

> 4) Can I get some clarification on the "There are no DB handles to use!
> skipped 0, tried to connect 0" and "Info: WARNING Child is hung for
> request XXXX in component post-auth module sql" radius.log messages, and
> the severity of these (i.e. does either of them definitively mean that a
> request was rejected?)

  Nope.  They mean what they say.  A slow child might suddenly come back
&& send an Access-Accept.  A system might have fail-over configured, so
that a DB problem fails over to another DB, or to another module.

  Alan DeKok.