System frequently stops responding...

Wed Jul 29 02:04:39 CEST 2015

My radius server receives traffic from cisco wireless controllers.. and
does one of two things: Authenticate requests for our local users, via our
LDAP servers, and forward non local users to eduroam servers.

What I notice is that the cisco controllers log, every now and then, that
requests (for user X, with request-id Y) timed out.

The radius server, now 3.0.8, is running in full debug mode, under strace.
along with continuous tcpdump. (strace radiusd and tcpdump restarting once
per hour to minimize the log size per file).

Now, when the cisco logs time outs... I can find the specific request-ids
in the pcap. It has been resent many times in the course of 30 seconds or
so.  In the debug output, I notice the requests being handled (received)
late, way after cisco had re-transmitted many times and timedout.

I noticed that these timeout event seem to coicide pefectly with 30 second
gaps in the radiusd log, like so:

Mon Jul 27 21:54:39 2015 : Debug: (31451)   } # Post-Auth-Type REJECT =
updated
Mon Jul 27 21:54:39 2015 : Debug: (31451) Sent Access-Reject Id 250 from
10.212.11.14:1812 to 10.128.254.52:32770 length 0
*Mon Jul 27 21:54:39 2015 :* Debug: (31451) Finished request
*Mon Jul 27 21:55:09 2015 :* Debug: PING: Zombie period is over for home
server eduroam1
Mon Jul 27 21:55:09 2015 : Proxy: Marking home server 64.57.22.74 port 1812
as dead.

The time between the two bold lines is when cisco logged many timeouts.

I also notice frequent errors about eduroam servers having been marked
down, failed to reply... not sure if this is actually the root cause...

So, am wondering... what could cause radiusd to freeze for 30 seconds? if
that is what's happening here... how do I dig more if that is not the case?

Strace seems to show the same missing 30 seconds (though this log is from
different occurrence):

18916 08:38:42.909134 write(3, "Tue Jul 28 08:38:42 2015 : Debug"..., 59) =
59 <0.000011>
18916 *08:38:42.909176* wait4(-1, NULL, WNOHANG, NULL) = -1 ECHILD (No
child processes) <*0.000019*>
18916 *08:39:12.500099 *write(3, "Tue Jul 28 08:39:12 2015 : Debug"..., 87)
= 87 <0.000017>
18916 08:39:12.500222 write(3, "Tue Jul 28 08:39:12 2015 : Proxy"..., 85) =
85 <0.000011>

Note that the cisco logs are not always about local users, or always about
non local users... but both. radius server just seems to go completely
unresponsive for a while....

Thanks a lot,
Mohamed.

On Tue, Jul 28, 2015 at 3:55 PM, <A.L.M.Buxey at lboro.ac.uk> wrote:

> Hi,
> > Can slowness of a remote server (eduroam in this case)  cause this
> behavior
> > am observing here? where the whole radiusd seems to freeze up for 30
> > seconds?
> > I've been looking at my LDAP server logs, and do not find signs of any
> > issues, no requests seem to have taken more than a second (no sub second
> > timestamps in the logs)....
>
> for remote servers that you may proxy to, use 'Status-Server' - that way
> the server knows
> if its alive/responding rather than unable to deal with a request...
>
> if you think its remote proxy...are you running in full debug mode..and
> what do you see happenign on the wire (eg with tcpdump)
> when this happens?
>
> alan
>
> -
> List info/subscribe/unsubscribe? See
> http://www.freeradius.org/list/users.html
>