Cannot control attribute ordering via "rlm_perl"

Tue Jan 24 11:54:06 CET 2012

Claude Brown wrote:
> My original reply was confusingly brief. I've clarified below, and I've also put the module we wrote into github in case it helps:
> 
> https://github.com/claudebrown/freeradius-server/compare/master...rlm_tagfiles

  OK.  It's... odd.

> We avoided both "fastfile" and reloading "files" on the fly because of the number of updates we have to our user setup.  The rate of change to our customers would require a reload every few seconds during most of the day.

  I'd normally just put users into SQL.

> We had concerns in two areas:
> - The time to re-write the config and then re-load so frequently. This may become a performance problem as our user base grows out to 250K
> - The risk of using the reload mechanism in a way that didn't seem consistent with its design intent, or the likely usage pattern of reloads every day or every few hours.

  OK.  Reloads don't work for you.

> FreeRADIUS core is very stable. But MySQL adds instability we have been unable to identify or reproduce in our environment.

  That's odd.  While MySQL isn't perfect, I have successfully used it in
systems with 100's of transactions/s.  There was a VoIP provider ~8
years ago using it with ~1K authentications/s.

> When large parts of our WiMAX network are restarted due to maintenance or failure the customer devices re-join the network. Whilst this doesn't happen often, when it does happen we need to get as many as 50K devices will simultaneously ask to rejoin the network.  We need to service this sudden and dramatic backlog as quickly as possible.

  Yup.

> With the "files" module this is a breeze with a single server.  It just eats it up and everything comes back in a few minutes. Importantly, our testing shows the design goal of 250K users would also be met with one server.
> 
> But with "rlm_sql" and MySQL we could not do it. The radiusd would start slowly grinding to a halt roughly as we reached 200 auths per sec (with EAP, this is about 30 devices per sec).  The radiusd log reported "Unresponsive child" in a MySQL module and gradually all the database concurrency would disappear as those threads were lost for further work.

  MySQL does have concurrency issues.  But if you split it into
auth/acct, most of those go away.  i.e. use one SQL module for
authentication queries.  Use a *different* one for accounting inserts.

  If you also use the decoupled-accounting method (see
raddb/sites-available), MySQL gets even faster.  Having only one process
doing inserts can speed up MySQL by 3-4x.

> With our new far simpler approach, all of this has gone away because we are now using the "files" module and "users" file directly. The speed of authentication is essentially as per that module.

  OK.

> The value of the extra attribute is in essence obtained like this:
> 1. Format a filename such as "/blah/%{Username}"
> 2. Read a line from this file

  Using a database WILL be faster than reading the file system.

> We only have about 10 different values in these files: things like "voip-customer", "payment-overdue", "gold-customer", "exceeded-download-limit", etc.  The value is used to select a DEFAULT entry in the "users" file that builds the reply attributes needed to configure the customers service.

  You can do the same kind of thing with SQL.  Simply create a table,
and do:

   update request {
      My-Magic-Attr = "%{sql: SELECT .. from ..}"
   }

  Have the table contain the mapping of User-Name --> "voip-customer".
You should be able to get very high performance.  Then, use that
attribute to do the mappings in the "users" file, just like you do today.

> This happens when we have a major network event that causes lots of devices to simultaneously request authentication. Due to the unpredictable loss of threads, we have to manually manage the rate of the incoming authentications by slowly starting small sections of the network at a time.
> 
> This process takes us hours of careful (manual) rate management.

  That's just weird.  SQL should be fine, *if* you design the system
carefully.  That's the key.

> Possibly, but we couldn't find a way. We would be keen to understand the fix for this.

  See above.

> We had no problem during normal operation.  It was only when large numbers of devices (typically 10K or more) simultaneously needed to re-join the network for some reason. 
> 
> Do you know if these other sites have those kinds of events?

  *Everyone* has this happen.  There's really no need for a new module.

> However, the stability issue would never go away. To me it smells of a race condition somewhere in the MySQL library. As we could only ever reproduce it by cycling 10K or more users, it was proving very difficult to debug.

  It's not a race condition, it's lock contention.

> But we spent far less time coding & testing a few 100 lines of "C" code than all the effort over the previous 18 months trying to reproduce, isolate or workaround the MySQL problem.  We gave up.
> 
> A nice bonus is that we can now head towards a single server configuration with a file-system database. This will allow us to retire a raft of servers doing proxying, multiple radiusd, and multiple MySQL instances.

  If it works for you...

  But it's really just a re-implementation of a simple SQL table.  It's
a solution which is specific to your environment.

  The more generic solution is:

- custom tables
- split auth/acct
- decouple acct from the "live" server

  You should be able to get a very high performance with that.  The
benefit is you'll be using real databases, which is usually a good idea.

  Alan DeKok.