Crash problem with FR 2.x.x when background databases delay
jmadden at ucsd.edu
Wed Dec 4 16:42:18 CET 2013
Since I got back from holiday I have been running the patched code without a crash despite several occurrences of the duplicate request that I believe would have crashed the previous version:
Tue Dec 3 07:38:43 2013 : Error: Discarding duplicate request from client Campus_WISM port 32770 - ID: 243 due to unfinished request 1381675
Tue Dec 3 07:38:53 2013 : Error: Discarding duplicate request from client Campus_WISM port 32770 - ID: 243 due to unfinished request 1381675
Tue Dec 3 07:39:01 2013 : Error: Received conflicting packet from client Campus_WISM port 32770 - ID: 243 due to unfinished request 1381675. Giving up on old request.
Tue Dec 3 07:39:22 2013 : Info: WARNING: Child is hung for request 1381675 in component authenticate module peap.
Tue Dec 3 07:39:39 2013 : Info: WARNING: Module rlm_eap became unblocked for request 1381675
I’m going to turn on some debugging and try to collect some more information during the day today.
On Nov 30, 2013, at 6:57, Alan DeKok <aland at deployingradius.com> wrote:
> Jim Madden wrote:
>> This note is both a request for assistance in debugging a problem where I’m way out of my depth and a warning that there may be a crash problem lurking in the current 2.x.x code that won’t be visible until a server running it encounters an unusual delay from a back end system. I am not including much detailed diagnostic output because the problem I’m describing doesn’t happen under single threaded debugging and the debugging traces and core dumps that accumulate in multi-threaded mode are gigantic.
> Yes, that's an issue. I've just pushed a change to the v2.x.x branch.
> It adds an assertion when the request is freed. The assertion checks
> that there's no child thread processing the request. If there is, it dies.
> It's not a solution, but it may help to debug the problem.
>> The crash does not happen under light load on the server and backend systems so that there are no conflicting packets, nor does it happen in unthreaded debug mode apparently because request processing happens sequentially. It does happen with radmin debugging turned on although those are so voluminous that it’s hard to extract useful information about what has happened.
> And it's hard to debug.
>> It is interesting that, even with backend delays, the crashes do not happen if I change the code in main/event.c that looks like
>> int received_request(rad_listen_t *listener,
>> to allow a 300 second leeway instead of the 1 second one so that in effect, received_conflicting_request is never called.
> That's just papering over the problem, unfortunately. The issue is
> What seems to be happening is that the main thread thinks the request
> is done, when it's not. So the request gets cleaned up, even if a child
> thread is still processing it.
>> I don’t think this particular code or most of its supporting routines have changed much between 2.2.1 and the current 2.x.x.
> The code changed a little. The change was to fix a bug where the
> server would proxy a request, but still mark it as being managed by a
> child thread. The change marked the request as *not* being handled by a
> child thread in more situations.
> And... it looks like it's doing that marking too often. <sigh>
> I've pushed some more assertions && code tweaks. Please try the git
> v2.x.x branch to see if it helps.
> Alan DeKok.
> List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html
More information about the Freeradius-Users