Zombie Clarification

Mon Mar 12 01:37:24 CET 2012

>  The "zombie" state is there for a reason.  Ignore zombies at your peril.

Well, I understand how the alive/zombie/dead process SHOULD work, but
I'm having trouble lining it up with what we're seeing. We're proxying
to a windows NPS box. Here's the proxy config:

home_server ias-1 {
	type = auth+acct
	ipaddr = 192.168.10.11
	port = 1812
	secret = "..."
	require_message_authenticator = yes
	response_window = 5
	zombie_period = 20
	revive_interval = 120
	status_check = request
	username = "ping_user"
	password = "bad_password"
	check_interval = 10
	num_answers_to_alive = 3

	coa {
		irt = 2
		mrt = 16
		mrc = 5
		mrd = 30
	}

	limit {
		max_connections = 16
		max_requests = 0
		lifetime = 0
		idle_timeout = 0
	}
}

Now, for whatever reason, the Windows box decides to discard some
requests. Unfortunately, the error reporting is pretty weak
("discarding invalid request"). Our Windows guys are digging into
this. It seems to be client specific, we suspect something with our
recently changed certificate.

FreeRadius is dropping into zombie state, which is expected given that
the home server is dropping requests. But our logs and packet captures
indicate that the home server is never dropping the "ping_user" status
checks that FR is using to determine the home server state. But, our
FreeRadius logs indicate that the home_server is being flagged 'dead'
immediately upon becoming zombie:

Sun Mar 11 20:32:26 2012 : Proxy: Marking home server 192.168.10.11
port 1812 as zombie (it looks like it is dead).
Sun Mar 11 20:32:26 2012 : Proxy: Received response to status check
133400 (1 in current sequence)
Sun Mar 11 20:32:27 2012 : Proxy: Marking home server 192.168.10.11
port 1812 as dead.
Sun Mar 11 20:32:27 2012 : Error: Discarding duplicate request from
client aerohive-aps port 49034 - ID: 113 due to unfinished request
133390
Sun Mar 11 20:32:39 2012 : Proxy: Received response to status check
133441 (1 in current sequence)
Sun Mar 11 20:32:39 2012 : Error: Discarding duplicate request from
client aerohive-aps port 49034 - ID: 113 due to unfinished request
133390
Sun Mar 11 20:32:46 2012 : Error: Discarding duplicate request from
client aerohive-aps port 46715 - ID: 29 due to unfinished request
133448
Sun Mar 11 20:32:47 2012 : Proxy: Received response to status check
133468 (2 in current sequence)
Sun Mar 11 20:32:58 2012 : Error: Discarding duplicate request from
client aerohive-aps port 46715 - ID: 29 due to unfinished request
133448
Sun Mar 11 20:32:59 2012 : Proxy: Received response to status check
133489 (3 in current sequence)
Sun Mar 11 20:32:59 2012 : Proxy: Marking home server 192.168.10.11
port 1812 alive
Sun Mar 11 20:33:23 2012 : Proxy: Marking home server 192.168.10.11
port 1812 as zombie (it looks like it is dead).
Sun Mar 11 20:33:23 2012 : Proxy: Received response to status check
133580 (1 in current sequence)
Sun Mar 11 20:33:27 2012 : Proxy: Marking home server 192.168.10.11
port 1812 as dead.
Sun Mar 11 20:33:35 2012 : Proxy: Received response to status check
133621 (1 in current sequence)
Sun Mar 11 20:33:51 2012 : Proxy: Received response to status check
133668 (2 in current sequence)
Sun Mar 11 20:33:56 2012 : Proxy: Marking home server 192.168.10.12
port 1812 as zombie (it looks like it is dead).
Sun Mar 11 20:33:56 2012 : Proxy: Received response to status check
133686 (1 in current sequence)

Why is the server going into zombie state at 20:32:26 and immediately
becoming dead at 20:32:27? Shouldn't it wait for the entire
zombie_period before dropping dead?

Thanks,

Norman Elton