Intermittent failures of mod_krb5

Alan DeKok aland at deployingradius.com
Fri Feb 23 14:44:48 CET 2018


On Feb 23, 2018, at 7:43 AM, Brian Candler <b.candler at pobox.com> wrote:
> I have been setting up freeradius-3.0.15 (from PPA) under Ubuntu 16.04, and using mod_krb5 to authenticate against Samba4.  There are four freeradius servers, each with a Samba replica behind it.
> 
> Once every few days, I get alerts from nagios saying one or more RADIUS servers has stopped working for a few minutes, and then they recover. There are two nagios servers making RADIUS checks across all four RADIUS servers, and both nagios servers see the problem simultaneously.
> 
> I now have authentication logging turn on, and at the time of the last event I see:
> 
> ... all OK
> Thu Feb 22 17:21:51 2018 : Auth: (273) Login OK: [nagiostest] (from client ix_nagios port 0)
> Thu Feb 22 17:22:38 2018 : Auth: (274) Login OK: [nagiostest] (from client wrn_mon2 port 0)
> Thu Feb 22 17:23:00 2018 : Error: Unresponsive child for request 275, in component authenticate module krb5
> Thu Feb 22 17:23:10 2018 : WARNING: (275) WARNING: Module rlm_krb5 became unblocked

  That means that the krb5 module was blocked in an IO call.  i.e. rlm_krb5 made a call to a kerberos library, which took a LONG time to respond.

> It seems to me that the module has unexpectedly become blocked (not unblocked!)

  The code catches the case where the request is running, but notices that something else set the "stop processing" flag.  This means that something else determined that the module was blocked, and told the module to stop.  Then at a *later* time, the module was unblocked, and was able to continue.

>  But does this simply mean that the module is taking longer to respond than expected?

  Yes.  Where "longer" means 30s or more...

  If you want to narrow it down, you'll have to instrument the krb5 module.  Poke at it so that it records the current time before doing a krb5 API call, and then looks at the time after it does that call.  If the difference is more than a second, that's the problem.

  This is one of the situations where we can't really do much.  We hand program execution off to a third-party library, and hope for the best.  If that library blocks the request for 30s, there is very little we can do in the server to make *that library* run faster.

  As for your scripts running fine, well... they must be doing something different.  That's all I can guess.

  Alan DeKok.




More information about the Freeradius-Users mailing list