Zombie Infestation of Log file

21 Apr 2010

      Good day,
I'm trying to figure out why my servers continue to be marked zombie, even
though they continue to handle traffic. There appears to be no impact, just
seemingly erroneous - or at least unexplained - log entries.

I have three 2.1.8 servers that feeds accounting to a 4th server (via
copy_acct_to_homeserver), running 1.1.7.  The primary servers also sends
some auth (via proxy) and lots of acct (some via proxy, but also via
copy_acct_to_homeserver) to a pair of Cisco ACS servers.  The radius.log
file for the primary servers show they are marking the 4th and Cisco
(upstream) servers as zombie quite regularly (but not simultaneously); they
thankfully never get marked dead.  All of these servers are attached to the
same Ethernet Switch with a slight detour through a Router that does VLAN
routing between them; The Cisco servers also proxy to various other servers
outside my network.

I have a debug output from one of the servers that I have studied at length;
I recorded the debug for 6 minutes or so before one of the servers was
marked zombie. This is from a production machine with a fair amount of
traffic, so the debug file is 9MBs.  I'm not sure if it would be appropriate
to post it to the mailing list. I'd be happy to post it. Do you want
specific excerpts, or the whole thing?

I've set the response_window to as high as 60 seconds in the clients.conf
file and I keep the zombie_period at 20 seconds. I've also turned off the
status_check feature as 1.1.7 and Cisco ACS do not appear to support it.

The clients.conf file says that after the response_window is up (so 60
seconds) and "a" response is not received, that the server is marked
Zombie.  Based on what I see, I'm interpreting this as meaning that if one
response is not seen in 60 seconds, even if hundreds of other responses were
successfully sent and received during those 60 seconds, then the server is
marked Zombie.  At this point the Zombie_Period kicks in, and the moment
"any" successful response is received the server is marked as completely
alive. In my case the Zombie_Period is canceled immediately (though sadly
the log does not seem to show when Zombie ends.)

Odd to me is that occasionally a primary server will mark the same upstream
server as Zombie multiple times over a handful of seconds, but the other two
primary servers rarely mark upstream servers dead near the same time.

I cannot find in the debug, or in packet captures, where a response went
missing for a full minute, so I'm trying to find out what is happening. My
upstream servers do not appear taxed or unresponsive. Perhaps there is some
sort of malformed response I should be looking for?  I also get radutmp
errors about a wrong NAS ID, though on brief analysis it doesn't appear
related. Any suggestions to help track this down and eliminate the error
messages is greatly appreciated.

-Benjamin

Zombie Infestation of Log file

Benjamin Marvin