Crash problem with FR 2.x.x when background databases delay

Alan DeKok aland at deployingradius.com
Sat Nov 30 15:57:41 CET 2013


Jim Madden wrote:
> This note is both a request for assistance in debugging a problem where I’m way out of my depth and a warning that there may be a crash problem lurking in the current 2.x.x code that won’t be visible until a server running it encounters an unusual delay from a back end system.  I am not including much detailed diagnostic output because the problem I’m describing doesn’t happen under single threaded debugging and the debugging traces and core dumps that accumulate in multi-threaded mode are gigantic.

  Yes, that's an issue.  I've just pushed a change to the v2.x.x branch.
 It adds an assertion when the request is freed.  The assertion checks
that there's no child thread processing the request.  If there is, it dies.

  It's not a solution, but it may help to debug the problem.

> The crash does not happen under light load on the server and backend systems so that there are no conflicting packets, nor does it happen in unthreaded debug mode apparently because request processing happens sequentially.  It does happen with radmin debugging turned on although those are so voluminous that it’s hard to extract useful information about what has happened.

  And it's hard to debug.

> It is interesting that, even with backend delays, the crashes do not happen if I change the code in main/event.c that looks like
> 	int received_request(rad_listen_t *listener,
...
> to allow a 300 second leeway instead of the 1 second one so that in effect, received_conflicting_request is never called.  

  That's just papering over the problem, unfortunately.  The issue is
elsewhere.

  What seems to be happening is that the main thread thinks the request
is done, when it's not.  So the request gets cleaned up, even if a child
thread is still processing it.

> I don’t think this particular code or most of its supporting routines have changed much between 2.2.1 and the current 2.x.x.

  The code changed a little.  The change was to fix a bug where the
server would proxy a request, but still mark it as being managed by a
child thread.  The change marked the request as *not* being handled by a
child thread in more situations.

  And... it looks like it's doing that marking too often.  <sigh>

  I've pushed some more assertions && code tweaks.  Please try the git
v2.x.x branch to see if it helps.

  Alan DeKok.


More information about the Freeradius-Users mailing list