rlm_perl - unresponsive child threads

Mon Oct 19 01:05:08 CEST 2020

>  Why?  The main SQL module can call stored procedures.  Just edit the queirs.

there's some other intermediate logic involved in handling the requests, so I can't simply use the standard rlm_sql module (I initially tried, but it didn't work out).

>  Well, the threads hang for 30s.  That's what happens when the queries take 30s to return.
 > Exactly.  You can't expect that the NAS wait around forever for replies to packets it sent.  RFC 5080 defines NAS behaviour.  Most NASes wait 20-30 seconds, and then give up.
 > You can change the NAS behaviour (if they're your NASes).  But this is the wrong approach.  It's putting bandages on a gushing wound.  It won't fix it.

I don't expect either the NAS or the radius server to wait for 30s, past the resend timeout. I just wanted to gain a better understanding of how it works within the radius server.
And, yes, I do control the NAS, so I * am * able to customize the timeouts & retries on it, but I'm also not going to claim this as the right approach in general (FYI, it's a very small custom LAN setup w/ only one NAS & one small radiusd server on a raspberry pi)

> Then fix them so that *all* queries run quickly.
>  It's really quite simple.  There simply is no reason for _any_ authorization query to take more than 1/10s.  If it does, your database is severely broken.
> I have to ask, tho.  *Why* are these queries taking 30s?  And how come it's impossible to change that?  Just what the heck is the database *doing* for 30s?

The queries normally run extremely fast, under ~0.05 sec, but the rPi hardware that it's running on sometimes experiences random hiccups that cause it to slow to a halt. It's hard to predict or prevent, and I don't have the option to upgrade the hardware in this particular case, so I'm stuck dealing with it & trying to find work-arounds.

> Accounting queries *may* take longer than that, for a number of reasons.  But it's easy to address that.  Dump the accounting packet to a log file.  See mods-availab/sql, "logfile", and scripts/radsqlrelay
> i.e. you need the accounting data stored to SQL.  But there's no reason to delay the reply to the NAS.

True, this is a good point. OK, I'll look into doing this.

>  b) conflicting packet from the NAS
>  The "conflicting packet" is because the NAS has given up on the old packet, and sent a brand new one.  Congratulations... you just wasted the last 30s of compute time, because it's meaningless.  The NAS won't get (or will ignore) any reply you send.

yes, this was already my understanding.
I wasn't expecting an explanation on the packet retries issue, just on how the internal radius timeout worked.

>   a) max_request_time timeout
> There's no module-specific timeouts, because it won't help.

OK, so I guess the setting I'd need to mess with is the 'max_request_time'.
In theory, I could extend the timeouts on both the NAS and the radiusd server and have it wait for a long query to end, correct?
NOTE: I'm * NOT * saying this is a good idea or a globally acceptable solution. I'm just trying to understand if this could theoretically work, and if it might help give me some margin to work with.

> Your question is essentially "I've nailed my feet to the floor.  I know it hurts, but I've done it.  How do I now run a marathon?"
>  Answer:  You can't.
> Really.  You can't.  Don't nail your feet to the floor.  There is *no* other good answer.

Aside from the slight harshness in your reply, I do appreciate your assistance.
I understand how frustrating this must be for you.

________________________________
From: Freeradius-Users <freeradius-users-bounces+romero619=hotmail.com at lists.freeradius.org> on behalf of Alan DeKok <aland at deployingradius.com>
Sent: Sunday, October 18, 2020 3:29 PM
To: FreeRadius users mailing list <freeradius-users at lists.freeradius.org>
Subject: Re: rlm_perl - unresponsive child threads

On Oct 18, 2020, at 5:21 PM, P. R.M. <romero619 at hotmail.com> wrote:
>
> I'm trying to debug some issues when using rlm_perl with a mysql backend.
>
> My module script uses the Perl DBI module in order to access the back-end, and I'm calling stored procedures for each of the radius sections (authorize, authenticate, accounting, etc.).

  Why?  The main SQL module can call stored procedures.  Just edit the queirs.

> Some of the queries will occasionally take too long (~20-30sec+), and this will apparently cause the freeradius child threads to hang & die (correct?).

  Well, the threads hang for 30s.  That's what happens when the queries take 30s to return.

> When this happens, I'll find multiple sequences of entries in the radiusd log file such as the following:
>
> Sat Oct 17 21:30:07 2020 : Error: Received conflicting packet from client 0.0.0.0/0 port 55526 - ID: 43 due to unfinished request in module perl.  Giving up on old request.
> Sat Oct 17 21:30:12 2020 : Error: Received conflicting packet from client 0.0.0.0/0 port 55526 - ID: 43 due to unfinished request in module perl.  Giving up on old request.
> Sat Oct 17 21:30:42 2020 : Error: Received conflicting packet from client 0.0.0.0/0 port 40463 - ID: 44 due to unfinished request in module perl.  Giving up on old request.
> Sat Oct 17 21:30:47 2020 : Error: Received conflicting packet from client 0.0.0.0/0 port 40463 - ID: 44 due to unfinished request in module perl.  Giving up on old request.

  Exactly.  You can't expect that the NAS wait around forever for replies to packets it sent.  RFC 5080 defines NAS behaviour.  Most NASes wait 20-30 seconds, and then give up.

  You can change the NAS behaviour (if they're your NASes).  But this is the wrong approach.  It's putting bandages on a gushing wound.  It won't fix it.

> I'm already working on trying to improve the query speed, but there still might be inevitable occasional slow queries.

  Then fix them so that *all* queries run quickly.

  It's really quite simple.  There simply is no reason for _any_ authorization query to take more than 1/10s.  If it does, your database is severely broken.

  Accounting queries *may* take longer than that, for a number of reasons.  But it's easy to address that.  Dump the accounting packet to a log file.  See mods-availab/sql, "logfile", and scripts/radsqlrelay

  i.e. you need the accounting data stored to SQL.  But there's no reason to delay the reply to the NAS.

> So, I'm trying to figure out how to gain more insight on this problem and hopefully prevent it and/or handle it better.
>
> So, my question(s):
> 1) what specifically causes the child thread to be detached/abandoned? is it a timeout issue within freeradius?

  a) max_request_time timeout
  b) conflicting packet from the NAS

  The "conflicting packet" is because the NAS has given up on the old packet, and sent a brand new one.  Congratulations... you just wasted the last 30s of compute time, because it's meaningless.  The NAS won't get (or will ignore) any reply you send.

> - In a normal Perl script that accesses a DB using DBI, a slow DB query will simply cause the script to wait indefinitely until the query finishes (unless a connection timeout occurs), but it wouldn't necessarily die/crash completely. This leads me to assume that there's some kind of hard-coded timeout within freeradius that expires & causes it to give up on the child thread if it's still stuck waiting for a query to finish. Is this correct or incorrect?

  See above.

> 2) if there *is* some kind of built-in timeout within freeradius for processing sections via rlm_perl, what is its value? can it be changed/extended via some configuration setting? (I looked around a bit in the source code, but didn't see anything)

  There's no module-specific timeouts, because it won't help.

> 3) Aside from trying to guarantee that a query will never run slowly (not realistic in my case),

  Then fix your design.  Or, make sure that FreeRADIUS isn't waiting for the reply to the query.

> is there anything else that can be done to handle a slow response from rlm_perl, instead of just giving up on the child thread? Some kind of debug logging for cases where the child thread(s) hang or don't respond in a timely fashion?

  See above.  If the *NAS* has given up on the packet, then you've wasted all of the time spent on the query.  No one will pay attention to the reply, so the work spent is useless.

  Your question is essentially "I've nailed my feet to the floor.  I know it hurts, but I've done it.  How do I now run a marathon?"

  Answer:  You can't.

  Really.  You can't.  Don't nail your feet to the floor.  There is *no* other good answer.

  I have to ask, tho.  *Why* are these queries taking 30s?  And how come it's impossible to change that?  Just what the heck is the database *doing* for 30s?

  Alan DeKok

-
List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html