FreeRadius hanging

Phil Mayers p.mayers at
Mon Oct 29 13:17:02 CET 2007


We've had sporadic problems with our Wireless radius service hanging.

The occurrences tended to be spaced weeks apart, and consist of clusters
of hangs 4-12 hours apart over a few days. I had formed the hypothesis
that a particular client or type of client was triggering it - when they
realised they could never authenticate (because unknown to them, they
were hanging the server) they went away.

These problems generally manifest in one of two ways:

 * the process hangs (and is unkillable except by -9) - most common
 * the process CPU runs to 100% - less common

The service basically does PEAP/MS-CHAP with ntlm_auth helper and
rlm_sql_postgresql; the database does not seem to be the issue. We are
not HUPing the server.

We were running on 1.1.6 and I was holding off reporting the issue until
we had time to move to 1.1.7; however the issue has started to occur
again and seems to be longer-lived this time.

I've upgraded to 1.1.7 and it's still happening.

I have managed to get a capture, and observe what the last packets were
before the server stopped responding; sure enough, it was in the middle
of a PEAP conversation. The last packet(s) the server manages to get our
contain the (PEAP-fragmented) SSL server hello/certificate/hello done.
The client then responds with the EAP/PEAP/TLS client key ex/change
cipher/encrypted handshake, to which it never gets a response.

The underlying OS is RHEL4 with OpenSSL 0.9.7a plus RHEL backports.

I haven't yet managed to capture a backtrace from a hung 1.1.7 but I got
a few from 1.1.6; nothing particular stands out. I'll wait until it
happens again and get a definitive one from the 1.1.7 binary.

Does anyone have any thoughts on what I might to do narrow the problem

