Post-crash investigations

Tue May 15 12:07:06 CEST 2012

Hi,

I setup a cluster (Pacemaker) of two freeradius servers on CentOS 5.8
(freeradius 1.1.3).
We use it for 802.1X: our switchs (HP Procurve) send it EAP challenges
and it authenticates our users/hosts against an ActiveDirectory
domain.
It's been running smoothly for a month and stopped working this
morning. We switched back to our old server but I want to know what
happened before putting the cluster back in production.

I'm currently trying to find out what happened but information is sparse:
- From the cluster perspective, everything is fine. The daemon is
running and no failover event occurred.
- From samba, same thing.
- In radius.log: nothing. There are client SSL certs errors but they
have been there the whole time.
- In auth-detail logfile: I can see all received requests. Before
_and_ after the problem. However, it doesn't contain the reply :-/
- In detail logfile: Nothing after 7:56, time of the outage.

So, I need pointers on two different issues:
- I don't have enough information in logs. I realize that the
recommended solution is to run the server in debug mode but I'm not
sure I can store a month - or more - of such verbose logging. Any way
to have more information, except the debug mode ?
- I failed to check the whole authentication process both at the
cluster level and at the monitoring level (nagios). Tools like
NTradPing and check_radius_adv are of little help here because they
can only check static accounts declared in radius configuration. They
do not exercise the whole chain (Windows client > switch > radius
server > ntlm_auth > AD). I tried replaying a successful
authentication but it failed (Access-Denied). Any way to perform a
'real' check ?

Sorry for the long post and thanks for any idea/pointer you can give me.

Regards,