FreeRADIUS can't make progress under certain load

Sat Sep 10 19:36:32 CEST 2011

On 09/10/2011 10:20 PM, Arran Cudbard-Bell wrote:
> Sat Sep 10 13:23:16 2011 : Auth: Login OK: [5702018] (from client 
> 10.10.70.100 port 0)
>>>> Sat Sep 10 13:23:16 2011 : Error: Received conflicting packet from client 10.10.70.98 port 1645 - ID: 131 due to unfinished request 2814. Giving up on old request.
>>>> Sat Sep 10 13:23:16 2011 : Auth: Login OK: [5026706] (from client 10.10.70.38 port 0)
>>>> Sat Sep 10 13:23:16 2011 : Error: Received conflicting packet from client 10.10.70.93 port 1646 - ID: 117 due to unfinished request 1036. Giving up on old request.
>>>> Sat Sep 10 13:23:16 2011 : Auth: Login OK: [4925140] (from client 10.10.70.93 port 0)
>>>> Sat Sep 10 13:23:16 2011 : Error: Received conflicting packet from client 10.10.70.60 port 1645 - ID: 109 due to unfinished request 3370. Giving up on old request.
>>>> Sat Sep 10 13:23:16 2011 : Auth: Login OK: [4977364] (from client 10.10.70.38 port 0)
>>>> Sat Sep 10 13:23:16 2011 : Error: Received conflicting packet from client 10.10.70.93 port 1646 - ID: 118 due to unfinished request 1040. Giving up on old request.
>>>> Sat Sep 10 13:23:16 2011 : Error: Received conflicting packet from client 10.10.70.60 port 1645 - ID: 110 due to unfinished request 3373. Giving up on old request.
>>>> Sat Sep 10 13:23:16 2011 : Auth: Login OK: [4529464] (from client 10.10.70.28 port 0)
>>>> Sat Sep 10 13:23:16 2011 : Error: Received conflicting packet from client 10.10.70.98 port 1645 - ID: 132 due to unfinished request 2829. Giving up on old request.
>>>>
>>>> and ad infinitum. The duplicate requests come from PPPoE clients after they fail to receive a response within 5 seconds or so.
>>> Yes its retransmitting the packets because it assumed they were lost in transit.
>>>
>>>> The problem can then be solved by restarting the daemon once or twice and watching the errors go away. The fact that dropping all current work helped gave me this idea: wouldn't it be nice if FreeRADIUS dropped both the old _and_ the new request at the time it logged that "giving up" message?
>>> No... Because then it'd never respond to either request. In a normal network (not under overload condition) if two packets were not lost but delayed, then both reach the server at around the same time, the server would end up dropping both requests ...and there's a bunch of other reasons depending on the deployment scenario.
>>>
>> Don't you think 5 second delay is pretty unusual for a LAN.
> Yes. But not everyone configures a 5 second retransmit period. Not everyone is trying to send RADIUS packets over a pure ethernet network with low latency, and even with ethernet, things like spanning-tree topology changes can have really weird effects on duplicate packets, even if the protocol tries to minimise it.
>
>> Yes, UDP acct packets sometimes get dropped and never reach the server, but the delay... It's much safer to sometimes drop 2 auth requests in a row for the reason you stated than allow the server to reach a point where it's unable to make further progress doing _current_ work.
> Why are you processing auth requests that you know cannot complete? The server should be configured such that the number of requests assigns to its threads is smaller than the number of requests that will cause your backend AAA systems to have a processing latency greater than your NAS' retransmit period.
>
> You're trying to fix the wrong thing. The server core is fine. Go fix the thread pool count.
>
Interesting. max_servers was 50 before today, I ramped it up to 100 
after today's accident.  A total of 24 CPU cores here.
CPU: Intel(R) Xeon(R) CPU           X5650  @ 2.67GHz (2661.34-MHz 
K8-class CPU)
The disks are a fast SAS 15K RAID, w/ 12G of RAM.

start_servers = 5
min_spare_servers = 5
max_spare_servers = 10
max_requests_per_server = 0

max_request_time = 120 # just so that it never triggers
cleanup_delay = 10
max_requests = 500000 # just so that it never triggers

But increasing them further doesn't seem relevant. You mean I should 
lower max_requests etc. to the point where I start getting the "too low 
max_requests" type of errors, and then up it a bit, but not by too much?
>>>> That would at least allow it to make progress. BTW, I'm not sure why, but under comparable workloads openradius does not exhibit this problem.
>>> What are you doing with FreeRADIUS which is taking such a long time to complete? The log messages are a symptom not a cause.
>>>
>> That's hundreds if not thousands (if you count acct packets too) of requests all reaching the server nearly at the same time.
> Yes but the server doesn't *have* to try and process them all at the same time. That is one of the other reasons for retransmit period.
>
> -Arran
>
> Arran Cudbard-Bell
> a.cudbardb at freeradius.org
>
> RADIUS - Waging war on ignorance and apathy one Access-Challenge at a time.
>
>