Good day,<br>

I'm trying to figure out why my servers continue to be marked zombie, 

even though they continue to handle traffic. There appears to be no 

impact, just seemingly erroneous - or at least unexplained - log 

entries.<br>

<br>

I have three 2.1.8 servers that feeds accounting to a 4th server (via 

copy_acct_to_homeserver), running 1.1.7.  The primary servers also sends

 some auth (via proxy) and lots of acct (some via proxy, but also via 

copy_acct_to_homeserver) to a pair of Cisco ACS servers.  The radius.log

 file for the primary servers show they are marking the 4th and Cisco 

(upstream) servers as zombie quite regularly (but not simultaneously); 

they thankfully never get marked dead.  All of these servers are 

attached to the same Ethernet Switch with a slight detour through a 

Router that does VLAN routing between them; The Cisco servers also proxy

 to various other servers outside my network.<br>

<br>

I have a debug output from one of the servers that I have studied at 

length; I recorded the debug for 6 

minutes or so before one of the servers was marked zombie. This is from a

 production machine with a fair amount of traffic, so the debug file is

 9MBs.  I'm not sure if it would be appropriate to post it to the 

mailing list. I'd be happy to post it. Do you want specific excerpts, or

 the whole thing?<br>

<br>

I've set the response_window to as high as 60 seconds in the 

clients.conf file and I keep the zombie_period at 20 seconds. I've also 

turned off the status_check feature as 1.1.7 and Cisco ACS do not appear

 to support it.<br>

<br>

The clients.conf file says that after the response_window is up (so 60 

seconds) and "a" response is not received, that the server is marked 

Zombie.  Based on what I see, I'm interpreting this as meaning that if 

one response is not seen in 60 seconds, even if hundreds of other 

responses were successfully sent and received during those 60 seconds, 

then the server is marked Zombie.  At this point the Zombie_Period kicks

 in, and the moment "any" successful response is received the server is 

marked as completely alive. In my case the Zombie_Period is canceled 

immediately (though sadly the log does not seem to show when Zombie 

ends.)<br>

<br>

Odd to me is that occasionally a primary server will mark the same 

upstream server as Zombie multiple times over a handful of seconds, but 

the other two primary servers rarely mark upstream servers dead near the

 same time.<br>

<br>

I cannot find in the debug, or in packet captures, where a response went

 missing for a full minute, so I'm trying to find out what is happening. My upstream servers do not appear taxed or unresponsive. Perhaps there is some sort of malformed response I should be looking 

for?  I also get radutmp errors about a wrong NAS ID, though on brief 

analysis it doesn't appear related. Any suggestions to help track this 

down and eliminate the error messages is greatly appreciated.<br>

<br>

-Benjamin<br>