FreeRadius 2.1.12 with winbind - performance issues

John Douglass john.douglass at oit.gatech.edu
Thu Jan 22 23:18:16 CET 2015


On 01/22/2015 04:42 PM, Matthew Newton wrote:
> On Thu, Jan 22, 2015 at 05:59:59PM +0000, Diggins Mike wrote:
>> I'm running two virtualized (ESXi) RedHat (v.5) version of
>> FreeRadius (2.1.12-4.el5_8) and winbind (3.5.10-0.110.el5_8)
>> using ntlm_auth for authentication and I've been running into
>> performance issues.
> Welcome to the club.
Hehe, I've been doing EXTENSIVE tweaking on our end and I still haven't
found any magic numbers. Currently on Samba 3.6 but moving to 4.1.14
VERY soon to address some performance issues. Namely that the number of
winbind processes only increases, only after 4.1.12 does the fix exist
to kill off connections idle more the X (default 60s) time. When you hit
the max number of DC connections winbind stops being able to
authenticate and just crashes and burns. Restarting winbind when it gets
near its threshold seems to help.
>> The systems mainly handle authentication traffic from a number
>> of Cisco WLAN controllers (5508 series) serving a University
>> with about 12000 active users during peak load. During class
>> change my radiusd process id CPU usage can rise from 10% to over
>> 100%.
> Run more RADIUS servers, split the load from the controllers
> across them. The WLCs will run out of RADIUS IDs with that number
> of auths. Cisco "issue".
We have been working very closely with Cisco and have a pre-alpha 8.x
controller release we are testing that directly addresses this issue,
but does not completely fix it. We have seen a definite decrease on the
issues between controllers and radius servers but the back end seems to
be the issue now (the Radius -> AD),

We did have to add a number of temporary controllers and radius servers
in order to alleviate the radius-id issue on the Cisco side, and have
yet to solve the backend radius->AD issues (samba winbind) so that we
feel comfortable starting to load up our controllers with more and more
APs.

But we are in WAYYYY far better shape than last year at this time.
>> 1. radiusd[2042]: Discarding duplicate request from client xxx
>> port 32769 - ID: 71 due to unfinished request 2860332 (very
>> common)
> ...
>> 3. Dropping request (1025 is too many): from client xxx port
>> 32769 - ID: 155 4. At its worst, the WLAN controllers will fail
>> over to the secondary radius server (back and forth too)
> Disable aggressive failover on the controllers to disuade
> them from jumping between RADIUS servers.
>
>     config radius aggressive-failover disable
>
>
>> 1. Increased my system resources (now 2 cores, 2G ram,
>> previously 1 core, 1G ram)
> We went from 512m to 1g RAM. I don't believe it makes any
> difference - FR doesn't use the RAM and there is stacks free.
>
>
>> 2. Increased 'max_requests' to 2048 (from 1048)
> Might as well. Went to 4096. Higher is only going to use a bit
> more memory.
>
>
>> 3. Add the line "winbind max domain connections = 10" to samba
>> smb.conf.
> With Samba 3, I don't believe this will make any real difference.
As long as it's above 3.6 that makes some difference. On some of my
servers, however, there is something that causes a spike to sometimes
upwards of 200+ connections.

max winbind connections helped improve but there are still some issues
somewhere in our network or on the AD servers I am working through.
>
>> 4. Increased the radius server timeout on the WLAN controller
>> from 2 seconds to 5 seconds (recommended by Cisco)
> Yes - we went to 10s.
>
>> I still see many of the "Discarding duplicate requests" and
>> radiusd CPU utilization still goes very high during class change
>> (>100%). Overall, system load averages have improved.
> I'm surprised you see CPU go up. We've only peaked at ~8000
> concurrent clients, but CPU usage is pretty low. Around 30% when
> busy.
>
>> My feeling is that increasing system resources again isn't going
>> to make a significant improvement and I'm considering adding a
>> third server. 
> My consideration is to have a RADIUS server per controller. Then
> configure each controller to failover between two of them. The
> worst is to end up with all your controllers on one RADIUS server,
> which they will do if you give then the chance.
>
>> 1. How could I tell if winbind is slowing the system down? 
> I think the bottleneck is in the calls to ntlm_auth, and winbind.
>
> I've been working on patches to FR and samba to get FR to call
> winbind directly rather than have to exec ntlm_auth. It shaves a
> lot of time off not doing an exec, but the patches aren't merged
> yet.
>
> They are stable in my testing, and if you're able to patch compile
> and test it would be great to know if it helps your situation. I'm
> still waiting for our traffic to build up again after the
> Christmas break (give it another week or two after the exams).
Here at Georgia Tech we would absolutely be willing to patch, test, and
compile any possible performance fixes between FR and winbind/samba. I
have the knowledge, mandate, and testing infrastructure. We even have
performance graphs on packets, radius logs, etc so we can verify
performance and add/remove load if it breaks things.
>
> The patch for FR2 is simple. The patch for Samba (3 or 4) is
> required because the libwbclient library is not currently
> thread-safe. Putting a mutex around the auth call rather defeats
> the point...
>
> Alternatively, there's a second patch for FR3 that uses ntlm_auth
> in socket mode. This saves the exec time and doesn't need patching
> Samba, but won't backport to FR2.
>
>> 2. Would switching to Kerberos for authentication instead of winbind help? 
> How? Move all your users to EAP/TTLS-PAP?
>
>> 3. Would upgrading to the latest versions of FreeRadius and
>> Winbind likely help (i.e. are there known improvements that
>> would make a difference)?
> FreeRADIUS, no, but do it anyway.
>
> Winbind - I believe there are good things in 4.2. Just haven't
> tested here yet.
>
> You have to remember that for every authentication, winbind
> maintains a cache of windows SID to linux UID, and writes this to
> disk. Given this is totally unnecessary for FreeRADIUS, I'd really
> like to find a way to stop it...
>
>> 4. Can anyone suggest other improvements I could make?
> Feed your logs into elasticsearch. Then you can get really
> depressed as you can actually see how bad it really is :). It's
> obvious that auths top out at around 30 per second. Any more, you
> get trouble.
>
> Matthew
>
>
I highly recommend moving from the 3.x Samba to the 4.x samba. I'm
testing the enterprisesamba.com 4.1.12 includes a fix for the winbind
request timeout:

https://www.samba.org/samba/history/samba-4.1.12.html

New parameter "winbind request timeout" has been added (bug #3204). Please
   see smb.conf man page for details.

Because we only use a small portion of samba (namely the domain join and
the winbind code) I find that we can survive (and in our case MUST
survive) by moving to a more bleeding edge version of samba.

- JohnD



More information about the Freeradius-Users mailing list