<html><head><meta http-equiv="Content-Type" content="text/html charset=iso-8859-1"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">Thank you all for your replies.<div>I detail some of the changes we have made below.<br><div apple-content-edited="true">

</div>

<br><div><div>On 2014-09-23, at 13:31 , John Douglass <<a href="mailto:john.douglass@oit.gatech.edu">john.douglass@oit.gatech.edu</a>> wrote:</div><br><blockquote type="cite">

    <meta content="text/html; charset=ISO-8859-1" http-equiv="Content-Type">

  <div bgcolor="#FFFFFF" text="#000000">

    When we are talking about AD, Phil Mayers had some great suggestions

    on improving ntlm_auth performance. Here were his recommendations:<br>

    <br>

    1. Upgraded the radius servers.

    <br>

     Old spec: 3Gb RAM, 2x P4-based Xeon 1 core @ 3.2GHz, RHEL5

    <br>

     New spec: 16Gb RAM, 1x Xeon E5-2620 6 core @ 2GHz, RHEL6

    <br></div></blockquote><div><br></div><div>It's running on a VMWare host in an ESX cluster at the moment.</div><div>We moved all other VMs off the host.</div><div>It's provisioned with 24 Gb of RAM and 12 cores.</div><div><br></div><br><blockquote type="cite"><div bgcolor="#FFFFFF" text="#000000">

     2. Upgraded Samba - went from RHEL5 samba3x-3.5.4 to RHEL6

    samba-3.6.9

    <br></div></blockquote><div><br></div><div>Upgraded yesterday to the latest available from redhat. That would be 3.6.9-169 at the time. </div><br><blockquote type="cite"><div bgcolor="#FFFFFF" text="#000000">

    <br>

     3. Set "winbind max domain connections = 12" in smb.conf 

    (restart winbind) (we at GT actually have so many authentications,

    we set to 128 as we reached our limit during peak times)<br></div></blockquote><div><br></div><div>We've had it running at 64 for a while. </div><div>We had to tune the AD for it to accept this many connections. </div><div>We based our DC settings on the advice of this article: </div><div><a href="http://support.microsoft.com/kb/2688798">http://support.microsoft.com/kb/2688798</a></div><div><br></div><br><blockquote type="cite"><div bgcolor="#FFFFFF" text="#000000">

     4. Forced our smb.conf to talk to specific AD controllers which are

    physical, not VMWare (most our DCs are VMWare)<br></div></blockquote><div><br></div><div>Can you explain how you forced it to choose those DCs? </div><div>I can't seem to get winbind to send requests to a specific DC.</div><div>It's got a mind of its own.</div><br><blockquote type="cite"><div bgcolor="#FFFFFF" text="#000000">

    <br>

     5. Spent a <b class="moz-txt-star"><span class="moz-txt-tag">*</span>lot<span class="moz-txt-tag">*</span></b> of time debugging and tracking

    the Samba->DC RPC round-trip times and hassling our AD people to

    keep these stable; not sure what they did, if anything.

<br></div></blockquote><div><br></div><div>I actually wrote a wrapper (in C) around ntlm_auth to log the times between calling ntlm_auth and it returning a value.</div><div>This is where I found values that vary wildly between 7ms and <= 3000ms (because FR has ntlm_auth_timeout = 3).</div><div><br></div><div>We later patched winbind to log the time between sending the requests to the DC and getting a reply.</div><div>Those timings are actually consistently fast now and yet the problem persists in FR. </div><div><br></div><br><blockquote type="cite"><div bgcolor="#FFFFFF" text="#000000">

     6. Increased radiusd.conf setting to "max_requests = 16384"

    <br></div></blockquote><div><br></div><div>I set it to 20000 long ago.</div><br><blockquote type="cite"><div bgcolor="#FFFFFF" text="#000000">

    <br>

     7. Worked really, really hard on getting the Cisco APs, AP radios

    and controllers to STOP CRASHING; their software quality has been

    abysmal, and this was a contributing factor - APs or controllers

    would crash under load, and this would trigger a burst of auths,

    which would trigger the problem.

    <br></div></blockquote><div><br></div><div>This part is out of my hands, but I will certainly pass you advice along...</div><br><blockquote type="cite"><div bgcolor="#FFFFFF" text="#000000">

    <br>

    As Alan said before, there are lots of moving parts where issues can

    happen. If you improve server performance within the pieces

    (AD/database/winbind/etc), that's a start. <br></div></blockquote><div><br></div><div>It's pretty clear to me it's not the database. I log slow queries, check processlist obsessively and it's mostly unused.</div><div><br></div><blockquote type="cite"><div bgcolor="#FFFFFF" text="#000000">

    <br>

    If you are in a large scale Cisco deployment, depending on how many

    APs and users, you may find yourself having issues regardless. It's

    a hard problem to advise on, but adding additional radius servers

    and optimizing ours for performance has helped us immensely.<br></div></blockquote><div><br></div><div>If anything, this will make me learn more about network programming.</div><div>I have taken some stack traces using gdb when the system is under load as well as stracing the process.</div><div><br></div><div>I can provide those if anyone is interested.</div><div>I see most threads just doing a sem_wait while Thread 1 is doing all the work. </div><div><br></div><div>This would be easier of course if I had consistently bad performance.</div><div>As it is, things only fall apart when a significant load is reached.</div><div><br></div>There, I just got another flurry of these while replying:</div><div><font face="Courier">Info: Child PID 26929 (/usr/bin/ntlm_auth) is taking too much time: forcing failure and killing child.</font></div><div><br></div><div><br></div><div>Regards,<br><div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">--</div><div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">Louis Munro<br><a href="mailto:lmunro@inverse.ca">lmunro@inverse.ca</a>  ::  <a href="http://www.inverse.ca">www.inverse.ca</a> <br>+1.514.447.4918 x125  :: +1 (866) 353-6153 x125<br>Inverse inc. :: Leaders behind SOGo (<a href="http://www.sogo.nu">www.sogo.nu</a>) and PacketFence (<a href="http://www.packetfence.org">www.packetfence.org</a>)</div></div><br></div></body></html>