Question about how to find unfinished requests

Tue Nov 12 21:17:30 CET 2013

Phil Mayers wrote:
> It seems to be related to "peak movement" times - start/end of lectures
> and so forth - and to a massive spike in auth load as clients roam
> between APs. The problem is basically that above a certain load, auths
> suddenly aren't answered quickly enough, and the whole system goes into
> this spiral of doom where retransmits and reauths dominate.

  The issue is that when ntlm_auth is blocked, the entire server melts
down.  This is the same as when an SQL DB blocks the server, too.

> In our case, it was fork/exec of ntlm_auth being slow, and moving to a
> faster box helped a *lot*.

  fork/exec shouldn't be that bad... the # of outstanding ntlm_programs
is limited, but only by the total # of threads.  So maybe the issues is
past fork/exec, and into ntlm_auth / winbind?

> In our case, we were seeing the spike on radius and ntlm_auth times, but
> not MSRPC, which told us it was local. A closer examination of vmstat
> output strongly suggested kernel load of fork/exec. A faster box with a
> newer Linux kernel does not have the same issues, even under
> significantly higher load.

  Weird, but OK.

> A few tricks needed here to process this data.
> 
> To get the ntlm_auth timings, at the moment you'll need a wrapper. I
> wrote a quick one in C which basically does clock_getime,
> fork/exec/wait, clock_gettime again the logs the result and returns the
> child exit status. I'll try to post the code later today.

  Maybe the changes in 2.2.3 will help here.  If the child takes more
than 1 second, you're better off giving up on the request.

  Alan DeKok.