Question about how to find unfinished requests

Tue Nov 12 11:26:39 CET 2013

On 11/11/13 23:09, John Douglass wrote:

> I am seeing a number of these messages:
>
> Nov 11 17:57:31 dvlanc radiusd[17781]: Received conflicting packet from
> client resnet4-WiSM-A port 32770 - ID: 18 due to unfinished request
> 928878.  Giving up on old request.

Yeah, lots of UK universities have run into this. In brief, it seems 
that many academic sites running Cisco lightweight wireless, but also a 
couple of non-Cisco sites, have started to experience this and related 
issues since the start of the academic year 2013. I suspect changes in 
client behaviour, client numbers/mix and also on the Cisco firmware 
side, are all contributing to it having suddenly appeared.

It seems to be related to "peak movement" times - start/end of lectures 
and so forth - and to a massive spike in auth load as clients roam 
between APs. The problem is basically that above a certain load, auths 
suddenly aren't answered quickly enough, and the whole system goes into 
this spiral of doom where retransmits and reauths dominate.

In our case, it was fork/exec of ntlm_auth being slow, and moving to a 
faster box helped a *lot*.

You should also:

  1. Ensure you are running Samba 3.6
  2. Look into the "winbind max domain connections" parameter
  3. Consider pinning yourself to the fastest / a dedicated DC

> How do I view "into the beast" to find the start of these requests? For
> example, I see no way to locate the stream of events for request 928878
> so that I can narrow down where the problem is. My gut says it's located
> down the chain at the AD server (accessed via the radius protocol)
> taking too long but I have to prove it before I can get some action.

I have spent a lot of time on this, and here's what I found helpful:

  1. Start a long-running tcpdump of both your radius and MSRPC traffic 
(i.e. the connection from Winbind->DC):

   tcpdump -s 0 -w long.cap -U port 1812 or microsoft-ds

  2. At the same time, start logging how long ntlm_auth takes - see below.

  3. At the same time, start running "vmstat 1" piped to a timestamped 
file, so you monitor system load with high granularity. Alternatively 
any other high-resolution - 10 seconds at most - monitoring of 
user/sys/wait/idle CPU, IO and context switch load.

  4. When you get a "spike" find out where in the chain - radius, 
ntlm_auth, MSRPC - the spike actually got to, and proceed from there.

In our case, we were seeing the spike on radius and ntlm_auth times, but 
not MSRPC, which told us it was local. A closer examination of vmstat 
output strongly suggested kernel load of fork/exec. A faster box with a 
newer Linux kernel does not have the same issues, even under 
significantly higher load.

A few tricks needed here to process this data.

To get the ntlm_auth timings, at the moment you'll need a wrapper. I 
wrote a quick one in C which basically does clock_getime, 
fork/exec/wait, clock_gettime again the logs the result and returns the 
child exit status. I'll try to post the code later today.

To get the radius timings I wrote a python script (again will try to 
post) which logs them in realtime, but it should be relatively easy to 
use an approach similar the the MSRPC one below.

To get the MSRPC timings, you can use wireshark to dump the RPC PDU 
header (body is encrypted) and match request/responses for on-the-wire 
speeds. Simple pipeline for it (the awk script assumes one TCP flow, 
modify as appropriate):

   tshark -n -r $CAP -R dcerpc -T fields \
     -e ip.src -e ip.dst -e frame.time_epoch \
     -e dcerpc.pkt_type -e dcerpc.cn_call_id >msrpc.log

   awk '
   $4==0 {
     rq[$5]=$3
     next
   }
   {
     d=$3-rq[$5]
     del rq[$5]
     msec=$3 % 1 * 1000000
     print strftime("%Y-%m-%dT%H:%M:%S", $3) "." msec, $5, d*1000
   }' msrpc.log >msrpc.dat

I'll try to write some of this up somewhere more useful than an email... 
In the meantime, if you'd like me to expand more on this, let me know.

Cheers,
Phil