LDAP timeouts during failure conditions

Wed Jun 29 18:59:05 CEST 2011

On 06/23/2011 05:28 PM, Alan DeKok wrote:
> Phil Mayers wrote:
>> So, some discussion on the JANET-ROAMING list leads me to believe that,
>> during an "ldap server down" condition, rlm_ldap will incur
>> "net_timeout" on every (or many) passes through the module.
>
>    It's better for the module to track when connections are down, and
> return quickly if all are down.

So:

https://github.com/philmayers/freeradius-server/commit/58e545bd183029da9cdb1e591cd38ca208f55f87

...this is *not* a connection pool, but an example of one way to solve 
the problem; spawn a child thread to create connections.

I'm aware the code as-is has big problems but it might inspire something 
more useful; off the top of my head:

  * the new "failure" flag to ldap_release_conn is used too 
aggressively, meaning rlm_ldap will drop a connection in some cases it 
doesn't need to

  * it doesn't touch the eDir code - I don't have a way to test it

  * there's no way to terminate and re-start the connection manager thread

  * the connection-manager thread does not obey the "-s" command line 
argument

  * it uses a dumb sleep() rather than semaphore to wake and commence 
re-connects

...and probably lots more.

Related to this, connection re-binding in the non-async case should 
probably live inside ldap_get_conn and move out of perform_search() and 
siblings. But the diff as-is is hopefully easier to read.

This patch also doesn't solve "LDAP-Group == X" pointing at one and only 
one module. One possible way to solve that is as per Alex suggestion, to 
manage the TCP connections ourselves (which we could do inside the 
worker thread) and when people pass in >1 hostname to the module, do 
some kind of round-robin / fastest-wins connection algorithm.

Comments welcome.