Observing a common error in Freeradius with Postgres but not justification why it happens

Sat Sep 2 12:08:13 UTC 2023

On Sep 2, 2023, at 5:10 AM, Ibrahim Al Mahfooz <ibrahim.nezar at sevennet.net> wrote:
> We have implemented freeradius 3.2.3 with Postgresql 12.15 to store
> accounting packets i.e start, interim-update, stop for reporting purposes.
> We customized the queries files to store fewer columns than the original
> schema since this is what we need.

  That's fine, if it's done carefully.

> The following columns are customized in the accounting queries and same
> columns in postgres radacct table:
> 
> radacctid, acctsessionid, acctuniqueid, acctstarttime, acctudpatetime,
> acctstoptime, callingstationid, framedipaddress, imei, 3gpplocation.

  Except the acctuniqueid column is used to index the table.  So if you delete that column, you don't have indexes.  And every select / insert / update will require a full table scan.

  There is pretty much zero downside to leaving in columns which you don't need.  But if you do want to remove columns, don't remove ones which are indexed.  And don't remove columns which are used as part of the WHERE clauses.

  Or maybe "customized" here means something different than the first paragraph, where it means "deleted columns".

> Also the following indexes are in radacct table:
> 
> 1. radacct_active_session_idx (this is the original from FR schema of
> postgresql)
> 2. framedipaddress, starttime, stoptime, updatetime (This one we added to
> use upon need in our read queries on report generation)

  I would recommend not using the same database for FreeRADIUS and for report generation.  What typically happens is that everything is fine until you want to generate a bunch of reports.  And then the reporting queries use all of the database resources, and FreeRADIUS will get robbed of resources.

  Set up DB replication.  FreeRADIUS writes to the primary database.  That replicates to the standby database.  And the reports run against the standby database.

> Based on the above observed messages, we tuned two matters:

  It is rare to fix these problems by "tuning" the database.  You have to build the DB with the correct architecture.  Which means primary / secondary replication, correct indexes for your queries, etc.

  Tuning will usually change the performance slightly.  Fixing the design can add enormous amounts of performance.

> 1. the postgresql db by changing the default shared_buffer from 128MB to
> 8GB since we have plenty of RAM in the server, and this clearly reflected
> in the performance and removed around 98% of the above error messages.

  That's good.

> However we sometimes see the same error message showing up in peak hours
> and KPI of radius is impacted per the NAS report.

  That means the database is undersized.  i.e. it can't keep up with the load.

  Either you need a more powerful database, or you need to make the current database more efficient.

> The questions to summarize are :
> 1. Is ~900 TPS and 10ms acceptable DB performance for accounting?

  That is the average case, which isn't relevant.  You're looking for the performance under load.  And if you get errors during peak hours, the database is slow, and can't keep up with the load.

> 2. In the max_requests setting, what does the clients mean here, does it
> mean NAS client or SQL connections based on max requests? how to size it?

  It doesn't matter a lot.  It's mainly to catch extreme corner cases.  It limits how many packets the server will accept before it starts complaining that it's overloaded.

  It should be set to at least 256 * the number of NASes. 

  i.e. set it based on inputs: the number of NASes.  Don't set it based on outputs: SQL connections.

> 3. What could be the culprit given the fact all the servers are relaxed?

  The database is slow.  It can't handle the load at peak traffic.

  My $0.02 is that if you go back to the default schema + indexes, and then add a database replica for the report generation, everything will be fine.

  This kind of thing falls under the common practice of "I spent a bunch of effort trying to optimize things, and now things are slower".  Well, don't do that.  The default queries are fine.  They're already optimized.  The default schema is fine.  It's already optimized.

   Alan DeKok.