Question about how to find unfinished requests
Phil Mayers
p.mayers at imperial.ac.uk
Wed Nov 13 09:35:42 CET 2013
On 12/11/13 20:17, Alan DeKok wrote:
>> In our case, it was fork/exec of ntlm_auth being slow, and moving to a
>> faster box helped a *lot*.
>
> fork/exec shouldn't be that bad... the # of outstanding ntlm_programs
> is limited, but only by the total # of threads. So maybe the issues is
> past fork/exec, and into ntlm_auth / winbind?
Sure; I am inferring that it was fork/exec, based on the massive surge
in user/sys/iowait and run-queue/cswitch during "events", but it's
entirely possible all that CPU was being burnt inside the Samba stack,
and that in Samba 3.6 it's not.
It's on my TODO list to go back to the old servers with a bunch of test
traffic then reproduce it, and *then* change one thing at a time.
Unfortunately we were a week into massive instability on our wireless at
that point, and extremely pointed questions were being asked, so I moved
to our new servers - therefore, new RHEL version & kernel, new Samba,
faster disk, more RAM, more and faster CPUs - any or a combination could
be the cause.
One thing whilst I think about it - people should note that Windows
processes these RPCs differently to LDAP/Kerberos traffic, specifically
inside a small (10 on win2012, 2 elsewhere) thread pool. Google
"MaxConcurrentApi" for details. In our case, the RPC timings proved it
wasn't an issue, but people should check it, and resize the AD thread
pool if needed.
> Maybe the changes in 2.2.3 will help here. If the child takes more
> than 1 second, you're better off giving up on the request.
Yes, I should have mentioned that. It's worth noting that people without
2.2.3 could, in the meantime, use the "timeout" utilty from coreutils as
a wrapper.
This reminds me of *another* issue that I failed to mention, and someone
pointed out to me off-list - the Cisco WLC/WISM apparently use a single
UDP socket for all radius requests to a single server - auth and acct -
and thus there's a 255-packet limit for in-progress requests. If the WLC
reaches that limit, it just starts re-using IDs aggressively, instead of
opening a socket, which is nice - if you're in the middle of processing
a conflicted request, you still burn the work you're currently doing,
and the result is never used.
I can't prove it, but I suspect that was an issue during the largest
spikes. It is certainly a problem if you run an eduroam server, where
proxied traffic can have very large RTTs. They're apparently going to
"improve" this in 7.6 - there will be a separate UDP socket for
auth/acct! Wow, thanks Cisco!!
More information about the Freeradius-Users
mailing list