FreeRadius 2.1.12 with winbind - performance issues
diggins at mcmaster.ca
Thu Jan 22 18:59:59 CET 2015
I'm running two virtualized (ESXi) RedHat (v.5) version of FreeRadius (2.1.12-4.el5_8) and winbind (3.5.10-0.110.el5_8) using ntlm_auth for authentication and I've been running into performance issues. The systems mainly handle authentication traffic from a number of Cisco WLAN controllers (5508 series) serving a University with about 12000 active users during peak load. During class change my radiusd process id CPU usage can rise from 10% to over 100%. During those times I see a variety of symptoms:
1. radiusd: Discarding duplicate request from client xxx port 32769 - ID: 71 due to unfinished request 2860332 (very common)
2. WARNING: Please check the configuration file. The value for 'max_requests' is probably set too low.
3. Dropping request (1025 is too many): from client xxx port 32769 - ID: 155 4. At its worst, the WLAN controllers will fail over to the secondary radius server (back and forth too)
I read through many of the posts on this list on the same topic and followed some of those recommendations. So far:
1. Increased my system resources (now 2 cores, 2G ram, previously 1 core, 1G ram) 2. Increased 'max_requests' to 2048 (from 1048) 3. Add the line "winbind max domain connections = 10" to samba smb.conf.
4. Increased the radius server timeout on the WLAN controller from 2 seconds to 5 seconds (recommended by Cisco)
I've seen some improvement. I no longer see the complaints about increasing max_requests size (or dropping requests). WLAN controllers are no longer failing to the other radius server, at least not as often. I still see many of the "Discarding duplicate requests" and radiusd CPU utilization still goes very high during class change (>100%). Overall, system load averages have improved.
My feeling is that increasing system resources again isn't going to make a significant improvement and I'm considering adding a third server.
>From what I've read, radiusd should be able to handle a large number of simultaneous requests and that it's mostly likely the back end database that is slow (in my case winbind). Watching the process list, I see very few ntlm_auth child processes created and they don't last very long. The winbind process id never climbs above 3%.
1. How could I tell if winbind is slowing the system down?
2. Would switching to Kerberos for authentication instead of winbind help?
3. Would upgrading to the latest versions of FreeRadius and Winbind likely help (i.e. are there known improvements that would make a difference)?
4. Can anyone suggest other improvements I could make?
More information about the Freeradius-Users