Possible issue with copy-acct-to-home-server relaying

Chris Moules chris at gms.lu
Sun Mar 15 15:38:19 CET 2009


Hi,

I have been seeing a strange phenomenon over the past week with my
RADIUS system.

Alan, the server is no longer crashing with the patches that you
supplied but it is also not 100% stable.

I am trying to figure our exactly what is going on. I have my
assumptions, but no evidence. The issue is with a
'copy-acct-to-home-server' style setup.

The issue manifests its self with the radiusd (or freeradius) process
eventually no longer responding to incoming packets. Also the process
slowly, over hours, is using more and more CPU time. If I try to analyse
the process, it is running and polling the logfile for requests to proxy.

I have noted that at the beginning of running the process, in debug, it
will poll and sleep, poll and sleep. After proxying for a while the
polling seems to become faster and I don't see the "Waking up in 0.8
seconds." messages any more. What I do see is "Polling for detail file"
messages going up the screen very fast.

In the systems 'normal' state I have a virtual server processing and
writing the log file and another reading it. A couple of times this past
week I have had warnings from the NAS units that the RADIUS server is
not responding any more. Looking on the server show the system running
but no new requests arriving. All I see the system doing is "Polling for
detail file".

I have now split the system so that I have one instance of the server
doing authentication and local accounting. This is also writing the
proxy log. I have another instance which only 'listens' for the log file
and does the proxying. After a few hours I see this proxy process using
up to 10% of the CPU for up to a minute at a time. It will go back down
to 0% but often hovers about 6% at the moment. The main freeradius
process that is doing all the work (db lokups / some auth proxying /
logging / etc) is sitting at 0% CPU but using more memory:

>From 'top'
--
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND

11380 freerad   20   0 45700 4124 1732 S    7  0.1   6:52.59 freeradius

11398 freerad   20   0  177m 4416 1628 S    0  0.1   0:02.75 freeradius

--

The last time I was able to observe a lock-up (NAS reports no responce)
of the server process the CPU rating for freeradius was 99%.

I am hoping that the splitting of the jobs will keep my system up and
running and also help locate the issue.

Does anyone have any ideas?

Regards

Chris



More information about the Freeradius-Devel mailing list