FreeRadius 2.1.12 with winbind - performance issues

Matthew Newton mcn4 at leicester.ac.uk
Thu Jan 22 22:42:25 CET 2015


On Thu, Jan 22, 2015 at 05:59:59PM +0000, Diggins Mike wrote:
> I'm running two virtualized (ESXi) RedHat (v.5) version of
> FreeRadius (2.1.12-4.el5_8) and winbind (3.5.10-0.110.el5_8)
> using ntlm_auth for authentication and I've been running into
> performance issues.

Welcome to the club.

> The systems mainly handle authentication traffic from a number
> of Cisco WLAN controllers (5508 series) serving a University
> with about 12000 active users during peak load. During class
> change my radiusd process id CPU usage can rise from 10% to over
> 100%.

Run more RADIUS servers, split the load from the controllers
across them. The WLCs will run out of RADIUS IDs with that number
of auths. Cisco "issue".

> 1. radiusd[2042]: Discarding duplicate request from client xxx
> port 32769 - ID: 71 due to unfinished request 2860332 (very
> common)
...
> 3. Dropping request (1025 is too many): from client xxx port
> 32769 - ID: 155 4. At its worst, the WLAN controllers will fail
> over to the secondary radius server (back and forth too)

Disable aggressive failover on the controllers to disuade
them from jumping between RADIUS servers.

    config radius aggressive-failover disable


> 1. Increased my system resources (now 2 cores, 2G ram,
> previously 1 core, 1G ram)

We went from 512m to 1g RAM. I don't believe it makes any
difference - FR doesn't use the RAM and there is stacks free.


> 2. Increased 'max_requests' to 2048 (from 1048)

Might as well. Went to 4096. Higher is only going to use a bit
more memory.


> 3. Add the line "winbind max domain connections = 10" to samba
> smb.conf.

With Samba 3, I don't believe this will make any real difference.


> 4. Increased the radius server timeout on the WLAN controller
> from 2 seconds to 5 seconds (recommended by Cisco)

Yes - we went to 10s.

> I still see many of the "Discarding duplicate requests" and
> radiusd CPU utilization still goes very high during class change
> (>100%). Overall, system load averages have improved.

I'm surprised you see CPU go up. We've only peaked at ~8000
concurrent clients, but CPU usage is pretty low. Around 30% when
busy.

> My feeling is that increasing system resources again isn't going
> to make a significant improvement and I'm considering adding a
> third server. 

My consideration is to have a RADIUS server per controller. Then
configure each controller to failover between two of them. The
worst is to end up with all your controllers on one RADIUS server,
which they will do if you give then the chance.

> 1. How could I tell if winbind is slowing the system down? 

I think the bottleneck is in the calls to ntlm_auth, and winbind.

I've been working on patches to FR and samba to get FR to call
winbind directly rather than have to exec ntlm_auth. It shaves a
lot of time off not doing an exec, but the patches aren't merged
yet.

They are stable in my testing, and if you're able to patch compile
and test it would be great to know if it helps your situation. I'm
still waiting for our traffic to build up again after the
Christmas break (give it another week or two after the exams).

The patch for FR2 is simple. The patch for Samba (3 or 4) is
required because the libwbclient library is not currently
thread-safe. Putting a mutex around the auth call rather defeats
the point...

Alternatively, there's a second patch for FR3 that uses ntlm_auth
in socket mode. This saves the exec time and doesn't need patching
Samba, but won't backport to FR2.

> 2. Would switching to Kerberos for authentication instead of winbind help? 

How? Move all your users to EAP/TTLS-PAP?

> 3. Would upgrading to the latest versions of FreeRadius and
> Winbind likely help (i.e. are there known improvements that
> would make a difference)?

FreeRADIUS, no, but do it anyway.

Winbind - I believe there are good things in 4.2. Just haven't
tested here yet.

You have to remember that for every authentication, winbind
maintains a cache of windows SID to linux UID, and writes this to
disk. Given this is totally unnecessary for FreeRADIUS, I'd really
like to find a way to stop it...

> 4. Can anyone suggest other improvements I could make?

Feed your logs into elasticsearch. Then you can get really
depressed as you can actually see how bad it really is :). It's
obvious that auths top out at around 30 per second. Any more, you
get trouble.

Matthew


-- 
Matthew Newton, Ph.D. <mcn4 at le.ac.uk>

Systems Specialist, Infrastructure Services,
I.T. Services, University of Leicester, Leicester LE1 7RH, United Kingdom

For IT help contact helpdesk extn. 2253, <ithelp at le.ac.uk>


More information about the Freeradius-Users mailing list