Question about how to find unfinished requests
Phil Mayers
p.mayers at imperial.ac.uk
Tue Nov 12 11:26:39 CET 2013
On 11/11/13 23:09, John Douglass wrote:
> I am seeing a number of these messages:
>
> Nov 11 17:57:31 dvlanc radiusd[17781]: Received conflicting packet from
> client resnet4-WiSM-A port 32770 - ID: 18 due to unfinished request
> 928878. Giving up on old request.
Yeah, lots of UK universities have run into this. In brief, it seems
that many academic sites running Cisco lightweight wireless, but also a
couple of non-Cisco sites, have started to experience this and related
issues since the start of the academic year 2013. I suspect changes in
client behaviour, client numbers/mix and also on the Cisco firmware
side, are all contributing to it having suddenly appeared.
It seems to be related to "peak movement" times - start/end of lectures
and so forth - and to a massive spike in auth load as clients roam
between APs. The problem is basically that above a certain load, auths
suddenly aren't answered quickly enough, and the whole system goes into
this spiral of doom where retransmits and reauths dominate.
In our case, it was fork/exec of ntlm_auth being slow, and moving to a
faster box helped a *lot*.
You should also:
1. Ensure you are running Samba 3.6
2. Look into the "winbind max domain connections" parameter
3. Consider pinning yourself to the fastest / a dedicated DC
> How do I view "into the beast" to find the start of these requests? For
> example, I see no way to locate the stream of events for request 928878
> so that I can narrow down where the problem is. My gut says it's located
> down the chain at the AD server (accessed via the radius protocol)
> taking too long but I have to prove it before I can get some action.
I have spent a lot of time on this, and here's what I found helpful:
1. Start a long-running tcpdump of both your radius and MSRPC traffic
(i.e. the connection from Winbind->DC):
tcpdump -s 0 -w long.cap -U port 1812 or microsoft-ds
2. At the same time, start logging how long ntlm_auth takes - see below.
3. At the same time, start running "vmstat 1" piped to a timestamped
file, so you monitor system load with high granularity. Alternatively
any other high-resolution - 10 seconds at most - monitoring of
user/sys/wait/idle CPU, IO and context switch load.
4. When you get a "spike" find out where in the chain - radius,
ntlm_auth, MSRPC - the spike actually got to, and proceed from there.
In our case, we were seeing the spike on radius and ntlm_auth times, but
not MSRPC, which told us it was local. A closer examination of vmstat
output strongly suggested kernel load of fork/exec. A faster box with a
newer Linux kernel does not have the same issues, even under
significantly higher load.
A few tricks needed here to process this data.
To get the ntlm_auth timings, at the moment you'll need a wrapper. I
wrote a quick one in C which basically does clock_getime,
fork/exec/wait, clock_gettime again the logs the result and returns the
child exit status. I'll try to post the code later today.
To get the radius timings I wrote a python script (again will try to
post) which logs them in realtime, but it should be relatively easy to
use an approach similar the the MSRPC one below.
To get the MSRPC timings, you can use wireshark to dump the RPC PDU
header (body is encrypted) and match request/responses for on-the-wire
speeds. Simple pipeline for it (the awk script assumes one TCP flow,
modify as appropriate):
tshark -n -r $CAP -R dcerpc -T fields \
-e ip.src -e ip.dst -e frame.time_epoch \
-e dcerpc.pkt_type -e dcerpc.cn_call_id >msrpc.log
awk '
$4==0 {
rq[$5]=$3
next
}
{
d=$3-rq[$5]
del rq[$5]
msec=$3 % 1 * 1000000
print strftime("%Y-%m-%dT%H:%M:%S", $3) "." msec, $5, d*1000
}' msrpc.log >msrpc.dat
I'll try to write some of this up somewhere more useful than an email...
In the meantime, if you'd like me to expand more on this, let me know.
Cheers,
Phil
More information about the Freeradius-Users
mailing list