rlm_perl - unresponsive child threads

Mon Oct 19 00:29:39 CEST 2020

On Oct 18, 2020, at 5:21 PM, P. R.M. <romero619 at hotmail.com> wrote:
> 
> I'm trying to debug some issues when using rlm_perl with a mysql backend.
> 
> My module script uses the Perl DBI module in order to access the back-end, and I'm calling stored procedures for each of the radius sections (authorize, authenticate, accounting, etc.).

  Why?  The main SQL module can call stored procedures.  Just edit the queirs.

> Some of the queries will occasionally take too long (~20-30sec+), and this will apparently cause the freeradius child threads to hang & die (correct?).

  Well, the threads hang for 30s.  That's what happens when the queries take 30s to return.

> When this happens, I'll find multiple sequences of entries in the radiusd log file such as the following:
> 
> Sat Oct 17 21:30:07 2020 : Error: Received conflicting packet from client 0.0.0.0/0 port 55526 - ID: 43 due to unfinished request in module perl.  Giving up on old request.
> Sat Oct 17 21:30:12 2020 : Error: Received conflicting packet from client 0.0.0.0/0 port 55526 - ID: 43 due to unfinished request in module perl.  Giving up on old request.
> Sat Oct 17 21:30:42 2020 : Error: Received conflicting packet from client 0.0.0.0/0 port 40463 - ID: 44 due to unfinished request in module perl.  Giving up on old request.
> Sat Oct 17 21:30:47 2020 : Error: Received conflicting packet from client 0.0.0.0/0 port 40463 - ID: 44 due to unfinished request in module perl.  Giving up on old request.

  Exactly.  You can't expect that the NAS wait around forever for replies to packets it sent.  RFC 5080 defines NAS behaviour.  Most NASes wait 20-30 seconds, and then give up.

  You can change the NAS behaviour (if they're your NASes).  But this is the wrong approach.  It's putting bandages on a gushing wound.  It won't fix it.

> I'm already working on trying to improve the query speed, but there still might be inevitable occasional slow queries.

  Then fix them so that *all* queries run quickly.

  It's really quite simple.  There simply is no reason for _any_ authorization query to take more than 1/10s.  If it does, your database is severely broken.

  Accounting queries *may* take longer than that, for a number of reasons.  But it's easy to address that.  Dump the accounting packet to a log file.  See mods-availab/sql, "logfile", and scripts/radsqlrelay

  i.e. you need the accounting data stored to SQL.  But there's no reason to delay the reply to the NAS.

> So, I'm trying to figure out how to gain more insight on this problem and hopefully prevent it and/or handle it better.
> 
> So, my question(s):
> 1) what specifically causes the child thread to be detached/abandoned? is it a timeout issue within freeradius?

  a) max_request_time timeout
  b) conflicting packet from the NAS

  The "conflicting packet" is because the NAS has given up on the old packet, and sent a brand new one.  Congratulations... you just wasted the last 30s of compute time, because it's meaningless.  The NAS won't get (or will ignore) any reply you send.

> - In a normal Perl script that accesses a DB using DBI, a slow DB query will simply cause the script to wait indefinitely until the query finishes (unless a connection timeout occurs), but it wouldn't necessarily die/crash completely. This leads me to assume that there's some kind of hard-coded timeout within freeradius that expires & causes it to give up on the child thread if it's still stuck waiting for a query to finish. Is this correct or incorrect?

  See above.

> 2) if there *is* some kind of built-in timeout within freeradius for processing sections via rlm_perl, what is its value? can it be changed/extended via some configuration setting? (I looked around a bit in the source code, but didn't see anything)

  There's no module-specific timeouts, because it won't help.

> 3) Aside from trying to guarantee that a query will never run slowly (not realistic in my case),

  Then fix your design.  Or, make sure that FreeRADIUS isn't waiting for the reply to the query.

> is there anything else that can be done to handle a slow response from rlm_perl, instead of just giving up on the child thread? Some kind of debug logging for cases where the child thread(s) hang or don't respond in a timely fashion?

  See above.  If the *NAS* has given up on the packet, then you've wasted all of the time spent on the query.  No one will pay attention to the reply, so the work spent is useless.

  Your question is essentially "I've nailed my feet to the floor.  I know it hurts, but I've done it.  How do I now run a marathon?"

  Answer:  You can't.

  Really.  You can't.  Don't nail your feet to the floor.  There is *no* other good answer.

  I have to ask, tho.  *Why* are these queries taking 30s?  And how come it's impossible to change that?  Just what the heck is the database *doing* for 30s?

  Alan DeKok