Good day,<br>
I'm trying to figure out why my servers continue to be marked zombie,
even though they continue to handle traffic. There appears to be no
impact, just seemingly erroneous - or at least unexplained - log
entries.<br>
<br>
I have three 2.1.8 servers that feeds accounting to a 4th server (via
copy_acct_to_homeserver), running 1.1.7. The primary servers also sends
some auth (via proxy) and lots of acct (some via proxy, but also via
copy_acct_to_homeserver) to a pair of Cisco ACS servers. The radius.log
file for the primary servers show they are marking the 4th and Cisco
(upstream) servers as zombie quite regularly (but not simultaneously);
they thankfully never get marked dead. All of these servers are
attached to the same Ethernet Switch with a slight detour through a
Router that does VLAN routing between them; The Cisco servers also proxy
to various other servers outside my network.<br>
<br>
I have a debug output from one of the servers that I have studied at
length; I recorded the debug for 6
minutes or so before one of the servers was marked zombie. This is from a
production machine with a fair amount of traffic, so the debug file is
9MBs. I'm not sure if it would be appropriate to post it to the
mailing list. I'd be happy to post it. Do you want specific excerpts, or
the whole thing?<br>
<br>
I've set the response_window to as high as 60 seconds in the
clients.conf file and I keep the zombie_period at 20 seconds. I've also
turned off the status_check feature as 1.1.7 and Cisco ACS do not appear
to support it.<br>
<br>
The clients.conf file says that after the response_window is up (so 60
seconds) and "a" response is not received, that the server is marked
Zombie. Based on what I see, I'm interpreting this as meaning that if
one response is not seen in 60 seconds, even if hundreds of other
responses were successfully sent and received during those 60 seconds,
then the server is marked Zombie. At this point the Zombie_Period kicks
in, and the moment "any" successful response is received the server is
marked as completely alive. In my case the Zombie_Period is canceled
immediately (though sadly the log does not seem to show when Zombie
ends.)<br>
<br>
Odd to me is that occasionally a primary server will mark the same
upstream server as Zombie multiple times over a handful of seconds, but
the other two primary servers rarely mark upstream servers dead near the
same time.<br>
<br>
I cannot find in the debug, or in packet captures, where a response went
missing for a full minute, so I'm trying to find out what is happening. My upstream servers do not appear taxed or unresponsive. Perhaps there is some sort of malformed response I should be looking
for? I also get radutmp errors about a wrong NAS ID, though on brief
analysis it doesn't appear related. Any suggestions to help track this
down and eliminate the error messages is greatly appreciated.<br>
<br>
-Benjamin<br>