Pre-release of Version 2.1.8

Tue Dec 8 14:50:14 CET 2009

Alan DeKok <aland at deployingradius.com> writes:
> Bjørn Mork wrote:
>> Yes, now it continues to answer both authentication and accounting
>> requests, but it still stops proxying after a while (where "a while"
>> might be something like 20+ hours and 1+ million auth requests - I have
>> no indication that these values are fixed).  
>
>   Look for the message:
>
> Failed creating new proxy socket: server is too busy and home servers
> appear to be down

Yes, I got one of those:

Tue Dec  8 08:49:22 2009 : Proxy: Marking home server 192.168.8.216 port 1812 as dead.
Tue Dec  8 08:49:22 2009 : Error: Failed creating new proxy socket: server is too busy and home servers appear to be down

>   It *should* continue to proxy after that, but it *won't* create new
> outgoing sockets.  And it *won't* proxy any more requests until the
> current requests have timed out.

So it's possible that a few failing home servers end up tying up all
sockets available for proxying?  Or am I misunderstanding this?  For how
long?  The server was not responding to requests going to a proxy at all
between 08:43:27 and 09:35:46.  That's a very long time to sit waiting
for available resources...

What's the actual limit here?  The number of open files left after the
sql module and others have taken their share?  Then increasing it, and
trying to tune the timeouts might help I geuss.

But I still believe sharing these resources in a way which let a few bad
servers steal them all, is wrong.  It should be possible to isolate two
independent realms from each other, even if they use the same proxy
radius. 

>   If the upstream servers really are that bad, I suggest configuring
> local detail files, as in raddb/sites-available/robust-proxy-accounting.
>    This should make the server log packets locally, and *not* track them
> in memory (which appears to be the problem).

Thanks, I'll look into that.  I do have to live with failing proxy
authentication as well, but I guess removing the accounting might take
some load off this.

>> There are a number of servers marked "alive", but these are all servers
>> which have been revived after the fixed period.  When used, they will be
>> marked dead/zombie again.
>
>   Configure "status_check".  Really.  Get the upstream servers to permit
> status-checks for a "test" user.  It will make your network *much* more
> robust.

On my TODO list.  Thanks.

>   And why are the upstream servers dying so consistently?  You're really
> in a corner case where you're testing FreeRADIUS in situations where the
> network Just Doesn't Work.  The suggestions above should help work
> around most of those issues.

Some of the upstream servers are just not very well mangaged, if managed
at all.  I wish I could ignore them, but I can't.  

>> But I will test that now, starting with the stable branch from
>> git.freeradius.org, commit d7b4f003477644978f3fefa694305dce9b5dc8bf,
>> which was the last point where things seemed to work
>
>   If that works, we could do a "git bisect" to find the issue.  There
> are only 26 commits since them, and many of those don't have code
> changes.  It shouldn't be too hard to track down the offending commit.

Given that I can actually trigger the problem in a test enviroment.  I
don't think I can.

Bjørn