DHCP code in 2.0.4+

Tue Jun 9 08:06:46 CEST 2009

Karl Auer wrote:
>>   The fail-over protocol does not work.  Full-stop.
> 
> Unless you come up with some very clever definition of "does not work",
> that's just plain wrong, Alan.  It clearly *does* work, most of the time
> for most of the people, and has been doing so in enterprises large and
> small for many years.

  It does something.  But it doesn't meet the goal of reliability.

  My issue is less with the protocol itself than with the belief that it
*does* work.  My experience with it has been less than positive.

> The fact that I haven't had a serious failure in the last eight years or
> so is a pretty good indicator to me that the protocol is robust
> *enough*.

  You've been lucky.  See the RELNOTES that is included with ISC for a
series of bug fixes to the protocol.  Both the implementation and the
protocol design have been changed substantially to avoid issues seen by
real-live people in the field.

> That is true of any protocol you care to name. It's also an unanswerable
> non-argument. Does inspection of the DHCP failover protocol reveal a
> theoretical failure mode to you?

  Yes.  A few quick tests demonstrated that failure.  See earlier
messages in this thread.

> Or is it that the ISC DHCP implementation that has exhibited failures?

  They've *admitted* failure.  Publicly.

---
FAILOVER:  As of version 3.0.4, ISC has included a fix for an insidious
bug in the failover implementation which, if left unchecked, could
result in tying up all leases in transitional states (such as released,
reset, or expired).  The crux of the problem is the lack of
retransmission of leases that rest in these states.  The only way to
solve this problem is to carry additional state on the lease data
structures to indicate acknowledgement state.
---

  That doesn't inspire confidence.  It's not just a bug, which even
FreeRADIUS has had from time to time.  The entire design of the protocol
has mutated and changed based on discovery of something they missed...
YEARS after the protocol was implemented.  See also the massive changes
in the protocol between 3.0 and 3.1.

  It's just like the duplicate detection cache implemented in FreeRADIUS
nearly 10 years ago, and by myself in other servers before that.  Yet it
was only with the recent publication of RFC 5080 that some major
commercial servers went "Oh, that's a good idea...", and implemented it.

  Until they did, they were subject to a number of *known* problems.

> That's possible. That never occurred to me, because it is allegedly
> interoperable with ISC DHCP. I will ask!

http://www.tolly.net/ts/2008/Nominum/DHCP2.2/Tolly208319NominumDHCP.pdf

  They might be inter-operable.  The major performance difference
between the two proves to me that the protocol between the Nominum
servers is *not* the same as the ones used between ISC servers, or
between ISC and Nominum.

  i.e. ISC claims to implement the protocol.  If its performance is so
much worse than Nomimum, then either (a), ISC didn't implement the
protocol as spec'd, or (b) Nominum didn't.

  And much of the rest of the performance difference is due to ISC *not*
using simple algorithms like "dynamic hash tables".

  It's almost like ISC is *deliberately* bad, to make people go to
Nominum.  That's OK.  It leaves a window of opportunity for me, to
create a DHCP server that *isn't* deliberately bad.

> It almost always works. It works *by far* most of the time. Even with
> ISC DHCP. To the point where I have not ever seen it fail except due to
> bugs in an implementation. My experience is not all-encompassing -
> perhaps you have seen it fail when the protocol was properly
> implemented.

  Yes.

> Yes. Or rather, it's delays in the operation of the failover protocol as
> implemented in ISC DHCP. Or I believe it to be - feel free to educate me
> otherwise.

  I really don't know.  I'm happy to say that both the protocol and the
implementation are "less than optimal".

>>   And I'll get money that Nominum is getting such high performance by
>> doing the kind of optimizations I'm talking about.
> 
> That could be. That is, their failover implementation may not follow the
> draft standard. However, if they were going to go non-standard, why not
> develop their own mechanism entirely? But I will ask them about this.

  I'm sure that they developed their own standard for communication
between Nominum servers.

  Alan DeKok.