Marking proxy servers as zombie - odd behaviour

John Horne john.horne at plymouth.ac.uk
Thu Jun 17 12:26:37 CEST 2010


Hello,

We have 3 backend servers which are used in a client-balance mode from
our local proxy server. We are running FR 2.1.10 (from git), but have
seen the following behaviour when we were running 2.1.7 and 2.1.9 for a
short time. Our logs are showing that FR marks the backend servers as
zombie every few minutes. Not all of them at once, just one or maybe
two, and then one will be marked as alive and another marked as zombie,
and so on. As far as we can see there is no general network problem, and
neither the radius nor backend servers are under any particular load.

Running tcpdump for a short time seemed to show that something odd was
going on (I have included the actual date/time of the packets, but
removed the contents. I think the packet header lines should show what I
mean):

=============================================================
2010-06-17 10:47:35 IP 141.163.66.101.1812 > 141.163.195.250.1814:
RADIUS, Access Accept (2), id: 0xdd length: 278

2010-06-17 10:47:46 IP 141.163.66.101.1812 > 141.163.195.250.1814:
RADIUS, Access Reject (3), id: 0xb2 length: 20
=============================================================


At this time, our radius log showed:

=============================================================
Thu Jun 17 10:47:35 2010 : Auth: Login OK: [xxxxxx] (from client
GRN-DVB-WISM-1 port 29 cli 7C-6D-62-63-03-8D)
Thu Jun 17 10:47:46 2010 : Proxy: Marking home server 141.163.66.101
port 1812 as zombie (it looks like it is dead).
Thu Jun 17 10:47:46 2010 : Proxy: Received response to status check
10959 (1 in current sequence)
=============================================================


So what is being seen is that backend server 141.163.66.101 has sent an
accept accept packet (to the local proxy server 195.250) and the log
shows a user as having authenticated. About 10 seconds later, the server
is marked as zombie, but tcpdump shows that a packet (access reject - we
have status-server set up with an invalid userid, so the reject is
correct) is received from that server.

So I think the question is why did FR think the server was zombie when
it had received an access-accept just a few seconds before? Why does it
think it looks like it is dead? The tcpdump shows no other packets being
sent to/from the server for FR to think that. And why is it that when FR
thinks the backend server is dead it receives a status check reply at
the same time? Did it send the status check query (and if so why didn't
tcpdump show it?) but for whatever reason immediately decide that the
server had not replied?

Maybe I am wrong but I would not have expected FR to even consider the
backend server as zombie/dead given that it had received a packet from
it 10 seconds before.



Thanks,

John.

-- 
John Horne, University of Plymouth, UK
Tel: +44 (0)1752 587287    Fax: +44 (0)1752 587001




More information about the Freeradius-Users mailing list