recommendations for max_servers

Louis Munro lmunro at
Wed Sep 24 20:24:22 CEST 2014

Thank you all for your replies.
I detail some of the changes we have made below.

On 2014-09-23, at 13:31 , John Douglass <john.douglass at> wrote:

> When we are talking about AD, Phil Mayers had some great suggestions on improving ntlm_auth performance. Here were his recommendations:
> 1. Upgraded the radius servers. 
>  Old spec: 3Gb RAM, 2x P4-based Xeon 1 core @ 3.2GHz, RHEL5 
>  New spec: 16Gb RAM, 1x Xeon E5-2620 6 core @ 2GHz, RHEL6 

It's running on a VMWare host in an ESX cluster at the moment.
We moved all other VMs off the host.
It's provisioned with 24 Gb of RAM and 12 cores.

>  2. Upgraded Samba - went from RHEL5 samba3x-3.5.4 to RHEL6 samba-3.6.9 

Upgraded yesterday to the latest available from redhat. That would be 3.6.9-169 at the time. 

>  3. Set "winbind max domain connections = 12" in smb.conf  (restart winbind) (we at GT actually have so many authentications, we set to 128 as we reached our limit during peak times)

We've had it running at 64 for a while. 
We had to tune the AD for it to accept this many connections. 
We based our DC settings on the advice of this article:

>  4. Forced our smb.conf to talk to specific AD controllers which are physical, not VMWare (most our DCs are VMWare)

Can you explain how you forced it to choose those DCs? 
I can't seem to get winbind to send requests to a specific DC.
It's got a mind of its own.

>  5. Spent a *lot* of time debugging and tracking the Samba->DC RPC round-trip times and hassling our AD people to keep these stable; not sure what they did, if anything. 

I actually wrote a wrapper (in C) around ntlm_auth to log the times between calling ntlm_auth and it returning a value.
This is where I found values that vary wildly between 7ms and <= 3000ms (because FR has ntlm_auth_timeout = 3).

We later patched winbind to log the time between sending the requests to the DC and getting a reply.
Those timings are actually consistently fast now and yet the problem persists in FR. 

>  6. Increased radiusd.conf setting to "max_requests = 16384" 

I set it to 20000 long ago.

>  7. Worked really, really hard on getting the Cisco APs, AP radios and controllers to STOP CRASHING; their software quality has been abysmal, and this was a contributing factor - APs or controllers would crash under load, and this would trigger a burst of auths, which would trigger the problem. 

This part is out of my hands, but I will certainly pass you advice along...

> As Alan said before, there are lots of moving parts where issues can happen. If you improve server performance within the pieces (AD/database/winbind/etc), that's a start. 

It's pretty clear to me it's not the database. I log slow queries, check processlist obsessively and it's mostly unused.

> If you are in a large scale Cisco deployment, depending on how many APs and users, you may find yourself having issues regardless. It's a hard problem to advise on, but adding additional radius servers and optimizing ours for performance has helped us immensely.

If anything, this will make me learn more about network programming.
I have taken some stack traces using gdb when the system is under load as well as stracing the process.

I can provide those if anyone is interested.
I see most threads just doing a sem_wait while Thread 1 is doing all the work. 

This would be easier of course if I had consistently bad performance.
As it is, things only fall apart when a significant load is reached.

There, I just got another flurry of these while replying:
Info: Child PID 26929 (/usr/bin/ntlm_auth) is taking too much time: forcing failure and killing child.

Louis Munro
lmunro at  :: 
+1.514.447.4918 x125  :: +1 (866) 353-6153 x125
Inverse inc. :: Leaders behind SOGo ( and PacketFence (

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the Freeradius-Users mailing list