RADIUS failing to start correctly when remote DB is unavailable.

Mon Oct 31 14:53:19 UTC 2022

Hi Alan,

Thank you for your explanation. During this time I have been trying to
solve this issue, and have seen that there are two patterns of how this can
occur.

The first scenario is when RADIUS is receiving an ICMP unreachable packet
when trying to connect to the PostgreSQL remote DB. This works as expected,
where RADIUS will mark it as down and continue starting. Here users connect
instantly without any delays. Below please find logs from debug, showing
the time difference from when RADIUS declares the remote DB as unreachable
until it is ready to process new requests (3 seconds).

14:15:42.665 (1) sql_remote: EXPAND
%{tolower:type.%{%{Acct-Status-Type}:-%{Request-Processing-Stage}}.query}
14:15:42.665 (1) sql_remote:    --> type.start.query
14:15:42.665 (1) sql_remote: Using query template 'query'
14:15:42.665 rlm_sql (sql_remote): 0 of 0 connections in use.  You  may
need to increase "spare"
14:15:42.665 rlm_sql (sql_remote): Opening additional connection (0), 1 of
32 pending slots used
14:15:42.665 rlm_sql_postgresql: Connecting using parameters:
dbname=tstradius host=xxx.xxx.xxx.xxx user=tstradiususer password=xxx
application_name='FreeRADIUS 3.0.25 - radiusd (sql_remote)'
14:15:42.666 rlm_sql_postgresql: Connection failed: could not translate
host name "xxx.xxx.xxx.xxx" to address: Name or service not known
14:15:42.666 rlm_sql_postgresql: Socket destructor called, closing socket
14:15:42.666 rlm_sql (sql_remote): Opening connection failed (0)
14:15:42.666 (1)     [sql_remote] = fail
14:15:42.666 (1)   } # accounting = fail
14:15:42.666 (1) Not sending reply to client.
14:15:42.666 (1) Finished request
14:15:42.666 (1) Cleaning up request packet ID 106 with timestamp +184
14:15:42.675 Waking up in 4.8 seconds.
14:15:47.378 (0) Cleaning up request packet ID 105 with timestamp +184
14:15:47.380 Ready to process requests

The second scenario is when access to the remote DB is restricted due to a
network outage. This is the issue that I am experiencing, and here RADIUS
is taking too long until it notices that the DB is unreachable. It seems
that the timeout is either not working or taking too long. Is there a way I
can change it? I have tried changing the parameters shown below (in the sql
files) however, the issue didn't resolve.

#  The number of seconds to wait after the server tries
#  to open a connection, and fails.  During this time,
#  no new connections will be opened.
retry_delay = 3

# The lifetime (in seconds) of the connection
lifetime = 1

#  idle timeout (in seconds).  A connection which is
#  unused for this length of time will be closed.
idle_timeout = 6

Also, from the debug below it shows that in this state it takes way longer
until it's ready to receive requests (2 minutes in this case).

09:43:36.193 (3) sql_remote: EXPAND
%{tolower:type.%{%{Acct-Status-Type}:-%{Request-Processing-Stage}}.query}
09:43:36.194 (3) sql_remote:    --> type.start.query
09:43:36.195 (3) sql_remote: Using query template 'query'
09:43:36.196 rlm_sql (sql_remote): 0 of 0 connections in use.  You  may
need to increase "spare"
09:43:36.198 rlm_sql (sql_remote): Opening additional connection (2), 1 of
1 pending slots used
09:43:36.209 rlm_sql_postgresql: Connecting using parameters:
dbname=tstradius host=xxx.xxx.xxx.xxx user=tstradiususer password=xxx
application_name='FreeRADIUS 3.0.25 - radiusd (sql_remote)'
09:45:46.745 rlm_sql_postgresql: Connection failed: could not connect to
server: Connection timed out        Is the server running on host
"xxx.xxx.xxx.xxx" (x.x.x.x) and accepting   TCP/IP connections on port
5432?
09:45:46.766 rlm_sql_postgresql: Socket destructor called, closing socket
09:45:46.767 rlm_sql (sql_remote): Opening connection failed (2)
09:45:46.768 (3)     [sql_remote] = fail
09:45:46.769 (3)   } # accounting = fail
09:45:46.771 (3) Not sending reply to client.
09:45:46.771 (3) Finished request
09:45:46.773 (3) Cleaning up request packet ID 151 with timestamp +451
09:45:46.774 Ready to process requests

What I am trying to achieve is to have the behaviour of the first scenario
whenever there is a network outage. Would it be possible to achieve this,
please?

Thanks again.

Kind Regards,
Clint

On Tue, Oct 11, 2022 at 3:59 PM Alan DeKok <aland at deployingradius.com>
wrote:

> On Oct 11, 2022, at 9:38 AM, Sea Gull <seagull0044 at gmail.com> wrote:
> > Good afternoon! I have a setup where RADIUS is set to write to multiple
> DBs
> > simultaneously. I have set this as follows:
> >
> > 1. Copied the SQL instance in /etc/raddb/mods-enabled/sql and had it
> > renamed and configured accordingly.
> > 2. Called them both
> > 3. In the pool, I have set start=0
> > 4. Set read_client to no
>
>   OK...
>
> > Although from debug these seem to be correctly set, I am still getting
> the
> > message that RADIUS is trying to connect to the DB when it is
> unavailable.
>
>   That's how it works.
>
>   The only way that FreeRADIUS knows that the DB is unavailable is by
> trying to connect, and failing.
>
> > After quite a number of minutes and 3 retries, RADIUS fails to start. I
> > have attached the full debug and included some explanation too along the
> > way.
>
>   The debug log shows that it's starting fine.  And that it's trying to
> connect to the SQL server when FreeRADIUS receives accounting packets.
> Because that's what you configured FreeRADIUS to do.
>
>   If you want it to dynamically choose an SQL server based on which one is
> up, you can put them into a redundant section:
>
>         redundant {
>                 sql1
>                 sql2
>         }
>
>   That will choose sql1 until it's down, and then will choose sql2.  But
> it will still try to use sql1, because FreeRADIUS has no idea that it's
> down.  There's no magical connection between SQL and FreeRADIUS which says
> "SQL database is down".
>
>   So the server does start, and it works as documented, and it works the
> only way it *can* work.  I'm not sure what else you expect it to do here.
>
>   My recommendation is that if FreeRADIUS is using a DB, then you should
> ensure that the DB is up 100% of the time.  All of the fail-over,
> redundant, etc. checks in the server are there only to catch unusual /
> error cases.  If your SQL server is *normally* down, then just configure
> FreeRADIUS to not use it.
>
>   Alan DeKok.
>
> -
> List info/subscribe/unsubscribe? See
> http://www.freeradius.org/list/users.html
>