Crash problem with FR 2.x.x when background databases delay

Fri Nov 29 07:33:34 CET 2013

This note is both a request for assistance in debugging a problem where I’m way out of my depth and a warning that there may be a crash problem lurking in the current 2.x.x code that won’t be visible until a server running it encounters an unusual delay from a back end system.  I am not including much detailed diagnostic output because the problem I’m describing doesn’t happen under single threaded debugging and the debugging traces and core dumps that accumulate in multi-threaded mode are gigantic.

We have been struggling with an attempted conversion of a Redhat compile-from-source Freeradius 2.2.1 deployment to a compile-from-source version of the latest 2.x.x code.  Our very heavily loaded Radius servers provide MAC address and EAP/PEAP/TTLS authentication through a set of very creaky multi-use back-end Mysql servers which can occasionally go non-responsive for many seconds at a time.  Ultimately I’d guess that we’ll switch to the FR 3 code but that change requires significant modification to a configuration that has grown up piecemeal over several years, so, in the meantime, we try to keep up with the FR 2 improvements.  We’ll also fix our underlying load issues with hardware changes over the next few months, but for now, we have to live with them, so we see the problem described here if we try to advance beyond 2.2.1 code.

Our 2.2.1 systems sometimes encounter database response delays that cause them to generate error messages like this
	Tue Nov 26 15:21:47 2013 : Error: Received conflicting packet from client Campus_WISM port 32770 - ID: 247 due to unfinished request 8533348.  Giving up on old request.
	Tue Nov 26 15:21:50 2013 : Info: WARNING: Module rlm_eap became unblocked for request 8533348

but the servers continue to run.

Under 2.2.3, the situation that produces these messages seems likely to generate a follow-on crash after a short period of time.  The resulting crash dump shows that the failure is an illegal memory reference in a thread that has a normal seeming stack of modcall routines and code that is referencing a linked structure  through a ‘request’ structure that has been damaged;   the particular code that fails is not consistent but seems always to be a reference through a ‘request’;  inspection of the broken request in the dump shows varying kinds of garbage;  to get to the failed code, lots of earlier routines in the call stack would have used the same ‘request’ pointer value to extract data and apparently succeeded in doing so.  It is as if some clean-up mechanism in another thread has destroyed the request while it is in use.  Other threads in the server seem perfectly happy, waiting for their turn to get a request or to receive data from a database or ntlm_auth process.  There seem always to be an error message like the one listed above in the log or log buffers.

The crash does not happen under light load on the server and backend systems so that there are no conflicting packets, nor does it happen in unthreaded debug mode apparently because request processing happens sequentially.  It does happen with radmin debugging turned on although those are so voluminous that it’s hard to extract useful information about what has happened.

It is interesting that, even with backend delays, the crashes do not happen if I change the code in main/event.c that looks like
	int received_request(rad_listen_t *listener,
	...
		/*
		 *	The new request is different from the old one,
		 *	but maybe the old is finished.  If so, delete
		 *	the old one.
		 */
		switch (request->child_state) {
			struct timeval when;

		default:
			/*
			 *	Special hacks for race conditions.
			 *	The reply is encoded, and therefore
			 *	likely sent.  We received a *new*
			 *	packet from the client, likely before
			 *	the next line or two of code which
			 *	updated the child state.  In this
			 *	case, just accept the new request.
			 */
			if ((request->reply->code != 0) &&
			    request->reply->data) {
				radlog(L_INFO, "WARNING: Allowing fast client %s port %d - ID: %d for recent request %u.",
				       client->shortname,
				       packet->src_port, packet->id,
				       request->number);
				remove_from_request_hash(request);
				request = NULL;
				break;
			}

			gettimeofday(&when, NULL);
			when.tv_sec -= 1;

			/*
			 *	If the cached request was received
			 *	within the last second, then we
			 *	discard the NEW request instead of the
			 *	old one.  This will happen ONLY when
			 *	the client is severely broken, and is
			 *	sending conflicting packets very
			 *	quickly.
			 */
			if (timercmp(&when, &request->received, <)) {
				radlog(L_ERR, "Discarding conflicting packet from "
				       "client %s port %d - ID: %d due to recent request %u.",
				       client->shortname,
				       packet->src_port, packet->id,
				       request->number);
				return 0;
			}

			received_conflicting_request(request, client);
			request = NULL;
			break;

to allow a 300 second leeway instead of the 1 second one so that in effect, received_conflicting_request is never called.  

I don’t think this particular code or most of its supporting routines have changed much between 2.2.1 and the current 2.x.x.  My limited understanding of what’s happening here is that the server has matched a new incoming request with one that’s already in process and calls received_conflicting_request to try to terminate the on-going processing;  this can leave an old request that’s waiting for some external event like a database request in process in an active thread but marked to stop when it comes to a sync point.  After doing this marking, the code above then goes on to queue the newly arrived request to be worked on.

I’m way out of my depth about how these two variants of the same request will be handled beyond this point in the server, but it appears that something has changed about it between the 2.2.1 code that works and the current 2.x.x code that crashes.  I wonder if some table look-up or queuing mechanism has changed so that two threads processing these requests get confused about each others data?

I will gladly provide more information under direction from the experts.  I don’t want to flood the list with endless streams of useless debugging.

Jim Madden
UCSD Networking
jmadden at ucsd.edu