Question about how to find unfinished requests

Phil Mayers p.mayers at imperial.ac.uk
Wed Nov 13 09:35:42 CET 2013


On 12/11/13 20:17, Alan DeKok wrote:

>> In our case, it was fork/exec of ntlm_auth being slow, and moving to a
>> faster box helped a *lot*.
>
>    fork/exec shouldn't be that bad... the # of outstanding ntlm_programs
> is limited, but only by the total # of threads.  So maybe the issues is
> past fork/exec, and into ntlm_auth / winbind?

Sure; I am inferring that it was fork/exec, based on the massive surge 
in user/sys/iowait and run-queue/cswitch during "events", but it's 
entirely possible all that CPU was being burnt inside the Samba stack, 
and that in Samba 3.6 it's not.

It's on my TODO list to go back to the old servers with a bunch of test 
traffic then reproduce it, and *then* change one thing at a time.

Unfortunately we were a week into massive instability on our wireless at 
that point, and extremely pointed questions were being asked, so I moved 
to our new servers - therefore, new RHEL version & kernel, new Samba, 
faster disk, more RAM, more and faster CPUs - any or a combination could 
be the cause.

One thing whilst I think about it - people should note that Windows 
processes these RPCs differently to LDAP/Kerberos traffic, specifically 
inside a small (10 on win2012, 2 elsewhere) thread pool. Google 
"MaxConcurrentApi" for details. In our case, the RPC timings proved it 
wasn't an issue, but people should check it, and resize the AD thread 
pool if needed.

>    Maybe the changes in 2.2.3 will help here.  If the child takes more
> than 1 second, you're better off giving up on the request.

Yes, I should have mentioned that. It's worth noting that people without 
2.2.3 could, in the meantime, use the "timeout" utilty from coreutils as 
a wrapper.

This reminds me of *another* issue that I failed to mention, and someone 
pointed out to me off-list - the Cisco WLC/WISM apparently use a single 
UDP socket for all radius requests to a single server - auth and acct - 
and thus there's a 255-packet limit for in-progress requests. If the WLC 
reaches that limit, it just starts re-using IDs aggressively, instead of 
opening a socket, which is nice - if you're in the middle of processing 
a conflicted request, you still burn the work you're currently doing, 
and the result is never used.

I can't prove it, but I suspect that was an issue during the largest 
spikes. It is certainly a problem if you run an eduroam server, where 
proxied traffic can have very large RTTs. They're apparently going to 
"improve" this in 7.6 - there will be a separate UDP socket for 
auth/acct! Wow, thanks Cisco!!


More information about the Freeradius-Users mailing list