RADIUS Monitoring tool

John Douglass john.douglass at oit.gatech.edu
Mon Mar 2 17:09:13 CET 2015


On 02/25/2015 08:28 AM, Clement Ogedengbe wrote:
> On two occasions in the last 2 weeks, our RADIUS server suddenly started to reject ALL users. Even though we have set up a failover system. Unfotunately, the fail-over system did not kick in because the RADIUS service was still running, only that it was rejecting all users for some strange reasons.
>
> Does anyone know of any monitoring script/tool that can be used to test that the RADIUS server is authenticating properly and which can send an alert by email or text in the event that the server rejects authentication of a valid user   credentials a number of times.  
>
> Best Regards
>
> Clement Ogedengbe
>
Clement,

At Georgia Tech we are currently refining our Radius monitoring
services. We are finding that using eapol_test is not enough when
debugging the variety of failure scenarios that can occur when using I
will be writing some updated PHP monitoring objects, if enough people
are interested in how we will be monitoring this service for its various
failure scenarios, I'm glad to share it. We've had a lot of people
looking into these issues and working through our pain points.

Just to give you some background, we currently have deployed 4 hardware
radius servers, 1 VM radius server, we are using 1 "shared" AD server
and 2 "dedicated" VM AD servers (that only our radius servers
communicate with via samba) between

Not here to argue over use of EAP-PEAP-MSChapV2 vs EAP-TLS, we all live
within our environments and make them the best there is.

If you are using the configuration of Controller -> Radius ->
Samba/ntlm_auth -> AD here are a number of things you need to consider
that we have come across:

1) The samba joins to AD are somewhat brittle.

There have been instances when our samba service (winbind) has
completely lost its privileges to AD. This has happened under numerous
versions and has happened at random times. I'm sure it's something to do
with the renegotiation of keys between the joined samba machine and the
AD servers.

When this "permission" issue occurs, radius is running peachy keen but
the responses from the ntlm_auth calls return failed. I do not believe
that the particular error message type ends up in the logs but it does
manifest it on the command line with something like:

     ntlm_auth --username=someuser --request-nt-key

will generate some final error mesage of:

     NT_STATUS_ACCESS_DENIED: Access denied (0xc0000022)

An eapol_test will simply tell you that the authentication failed (it's
a yes or no answer) but not why without going into either debug mode or
running the ntlm_auth command by hand alongside it and capturing that
output.

If anyone has some experience in the above failures of Samba joins to
AD, I'd love to hear it.

2) AD servers get overloaded when we are talking about large numbers of
users. We are still learning what our load limits are. Our devicebase is
anywhere between 20k-25k+ with two dedicated AD vms (scaled large) with
radius requests from controller with about 250 APs (we had to scale down
due to a radius flaw in the controller software that we have been
testing a fix for Cisco for).

You absolutely need to use the maxConcurrentApi connection setting if
you are doing any sort of large user transactions.

    http://support.microsoft.com/kb/2688798

3) The number of (default) connections from winbind to AD are limited.
You need to use modern versions of samba (look at EnterpriseSamba.org
for modern packages) and use properly configured smb.conf. I have found
these settings for us work. Are the optimal/ideal? I have no idea :) But
it seems to work well-ish. If you are serious about using samba/winbind
you need to be using the latest 4.1.17 which fixes the following issues:

a) Security flaw in smbd (even though we don't run it, just nmbd and
winbind it's better to have it!)
b) Connections to AD dynamically grow and shrink as need arises
(previous versions to 4.1.16 did not do this well, and not at all before
I think 4.1.12) (previous versions just increased the number of
connections during a spike and left them hanging out...eventually
causing problems).

My settings (probably allll sorts of wrong as I am far from a samba
expert) are:

winbind max clients = 16192
winbind max domain connections = 128
winbind request timeout = 30
winbind reconnect delay = 5
log level = 4
syslog = 6

(I log my samba stuff so I can look deeper into authentication issues).

Most of the issues we have found are rarely with Freeradius. The only
complaint about Freeradius is that it is sometimes hard to correlate the
radius error messages to the authentication requests so that we can see
"this error X caused this authentication on line Y to fail".

So long story short:

1) Use eapol_test as a base with a known "good" username/password pair
2) use results from ntlm_auth command for additional test and further
information on what might possibly be going wrong with that same
username/password pair if eapol_test fails

- JohnD




More information about the Freeradius-Users mailing list