FATAL! Server is too busy to process requests

Thu Feb 16 18:07:28 CET 2006

"Mitchell, Michael J" <Michael.Mitchell at team.telstra.com> wrote:
> I'm at a bit of a loss. I'm currently trying to load test the
> authentication proxy performance of freeRADIUS 1.0.1 in preparation for
> a deployment this weekend.
> 
> Unfortunately, I'm running into this error "Error: FATAL!  Server is too
> busy to process requests".

  Either the server is overloaded, or the back-end databases are too
slow.

> Interestingly, this error doesn't seem to occur when the openLDAP server
> is running on a different server, however the rate of requests that I
> can push through the server is also a lot less in this circumstance
> (about 25%).

  It's not about total rate, it's about per-request time.

> The CPU (according to prstat) doesn't need to be at 100% for this to
> occur either. However typically when it does occur radiusd is using all
> or close to all of one of the CPU's.

  If the back-end database is slow, it's not about 100% CPU.  It's
about a slow back-end database.

> It also doesn't happen when I run the server with -xx. Presumably this
> is because the extra output slows the server down enough such that its
> not hitting whatever barrier is causing this.

  It's because the incoming packets aren't put into a queue to
process.  Instead, the server reads them one at a time.  If there are
too many in the kernel's UDP queue, the kernel drops the packets.

  FreeRADIUS has an internal queue used when threading.  The purpose
is to be able to catch temporary spikes in traffic without dropping
packets.

> Any advice or help is really appreciated at this stage. What might be
> the cause of (*request)->child_pid != NO_SUCH_CHILD_PID in
> request_dequeue? Anything I should look at, or tune to reduce the
> likelyhood of this occurring?

  It's to work around a race condition in the server.

  A different work-around is to move the "send proxy packet"
(i.e. rad_send) from src/main/proxy.c to src/main/threads.c, in the
function request_handler_thread(), inside of the mutex lock, where it
says "update the active threads", just before it does active_threads--.

  You'll have to put a few additional checks there, but it may work.

> It seems that I can also resolve the issue (at least for the same
> requests rate) by looping at the "select" in requests_dequeue 20 times
> instead of 10.
> 
> What risk does this present?

  Not much.  The previous hack would be better.

> I then get errors like:
> 
> Fri Feb 17 03:10:54 2006 : Error: Dropping conflicting packet from
> client dbst1:63628 - ID: 198 due to unfinished request 44357
> 
> Which is better (to me) than the server stopping. ;-)

  Not really.  It means that the NAS has given up on that request, and
the server should therefore stop processing it immediately, as the NAS
won't care about response.

  Alan DeKok.