server cycles and memory utilization blamed on large radutmp radwtmp

Wed Jan 25 12:31:28 CET 2006

Alan DeKok wrote:

> Joe Maimon <jmaimon at ttec.com> wrote:
> 
>>Apparently something about those files being a couple hundered megabytes 
>>  triggered the server to eat memory and cpu and generate<TIMEOUT> 
>>errors, no available thread, and core dumps.
> 
> 
>   Weird.  Very, very, weird.  All I can think of is that it took more
> than 30s to root through radutmp.  That would cause problems.  One
> solution would be to just stop using radutmp, and use a real database
> when the size of the file gets large.

This would appear to be the case. I went and had a quick look at 
rlm_radutmp.c

Increasing max_servers to 128, max_request to 512 and max_request_time 
to 120 just made the server use up a gig of ram and peg 99 for about 
5-30 minutes before it crashed.

(kicking myself for deleting the file instead of saving it elsewhere)

rlm_radutmp.c contains loops that read the entire file sizeof(struct 
radutmp) bytes at a time.

Short of looping backwards from the end of the file and/or performing 
larger read()s into a buffer, I dont see any other way to improve on that.

ISTR some distributions logrotate the radutmp files, based on size or on 
a monthly basis.

For this server its certainly not critical, so it may just get rotated 
or eliminated.

What do you think of threads that undertake possibly open-ended time 
actions, should they timestamp them and check that they are not 
exceeding max_request_time?

Maybe this should be done with signals and modules would register 
allocations that would be cleaned up from the signal handler whenever 
time ran out on the request?

> 
>   For radwtmp, records are just appended, so there shouldn't be a
> problem.
> 
> 
>>+++ freeradius-2.0.0/src/modules/rlm_exec/rlm_exec.c	2006-01-24 18:35:58.000000000 -0500
>>@@ -281,6 +281,13 @@
>> 	VALUE_PAIR *answer;
>> 	rlm_exec_t *inst = (rlm_exec_t *) instance;
>> 
>>+	if (!inst || !request) {
>>+		radlof(L_ERR, "%s: %s() line: %d , a very bad thing happened",
> 
> 
>   That should really be caught in the server core, before the module
> is executed.  src/main/modcall.c tries to do this, I think.
> 

This one was not actually a cause, the core showed it as non-null. The 
portion below that in the patch was where gdb showed execution at, so it 
was probably in the compound if statement.

> 
>>+++ freeradius-2.0.0~pre0~cvs20051222-0-JM/src/main/request_process.c	2006-01-24 17:41:13.000000000 -0500
>>@@ -559,6 +559,7 @@
>> 	 *	suppress packets which aren't supposed to be sent over
>> 	 *	the wire, or to be delayed.
>> 	 */
>>+	if (request && request->listener && request->listener->send)
> 
> 
>   Hmm... that may be better done by just bailing if the request is
> deleted.
> 
>   The server doesn't handle deleting "live" requests that well.  It's
> a problem.
> 

Doesnt the server kill threads when deleting live requests?

The way I read it, the server proccessing a request->finished = TRUE 
would trip on rad_assert(request->child_pid == NO_SUCH_CHILD_PID) if 
there was still a thread on it. Still, that might be the cause - that I 
have changes that attempt to work around other problems by setting that.

Possibly, the server shouldnt delete live requests and instead semaphore 
the thread to delete it when its done.

Of course, threads that hang would now cause a memory leak.

So perhaps a second cleanup sweep?

> 
>>+++ freeradius-2.0.0/src/main/acct.c	2006-01-24 22:35:24.000000000 -0500
>>@@ -152,7 +152,8 @@
>> 		 */
>> 		case RLM_MODULE_OK:
>> 		case RLM_MODULE_UPDATED:
>>-			request->reply->code = PW_ACCOUNTING_RESPONSE;
>>+			if (request->reply)
> 
> 
>   That should only be necessary if the request is free'd.  In that
> case, the *only* thing to do is to bail out of handling the request.
> 

So if this is a concern, the check should be made on entry to the funtion?

>   This is really what exceptions are for.
> 
>   Alan DeKok.
> - 
> List info/subscribe/unsubscribe? See http://www.freeradius.org/list/devel.html
> 

Well looks like the problems are solved for now, so this is probably not 
something to lose sleep over.

Thanks,

Joe