Cannot control attribute ordering via "rlm_perl"

Tue Jan 24 01:54:48 CET 2012

Alan,

My original reply was confusingly brief. I've clarified below, and I've also put the module we wrote into github in case it helps:

https://github.com/claudebrown/freeradius-server/compare/master...rlm_tagfiles

(about 60 lines of C beyond usual module plumbing; 250 lines in total)

Alan DeKok wrote:
> 
> > - Allow high rate of user-by-user updates; i.e. avoid config re-write as
> per
> > "rlm_fastfile"
> 
>   ?  The "fastusers" module is deprecated, because the "files" module is
> just as fast.  The "files" module also can be HUP'd, so it can be
> reloaded on the fly.

We avoided both "fastfile" and reloading "files" on the fly because of the number of updates we have to our user setup.  The rate of change to our customers would require a reload every few seconds during most of the day.

We had concerns in two areas:
- The time to re-write the config and then re-load so frequently. This may become a performance problem as our user base grows out to 250K
- The risk of using the reload mechanism in a way that didn't seem consistent with its design intent, or the likely usage pattern of reloads every day or every few hours.

> > - Simple for stability: no shared in-memory state (avoid locking and
> races)
> 
>   The server core takes care of that when the "files" module is reloaded.
> 

These "Simple for stability" points were goals for our code. It wasn't something we were worried about for the existing code-base.

FreeRADIUS core is very stable. But MySQL adds instability we have been unable to identify or reproduce in our environment.

A crucial success factor for us was to ensure our module code was so simple it was very easy to be confident that stability was maintained. The strategy was to minimise the amount of software outside FreeRADIUS core.

> 
>   Daily config reloads are easy.
> 

Agreed. If we only needed daily, the "files" module would be perfect.

>   Say you have a format similar to the "users" file, with one user per
> file.  Loading 100K users will mean 100K file reads, and that can take a
> long time.

The module doesn't re-implement the "users" format or have a "users" file for every user.  It does not read 100K (or even 10) files at start-up.

The "files" module is used directly with a single normal "users" file just as per any normal FreeRADIUS deployment.

> > We acheived all these goals and can now process bring all our customers
> > back onto our service in about five minutes. 
> 
>   5 minutes for what, exactly?
> 

When large parts of our WiMAX network are restarted due to maintenance or failure the customer devices re-join the network. Whilst this doesn't happen often, when it does happen we need to get as many as 50K devices will simultaneously ask to rejoin the network.  We need to service this sudden and dramatic backlog as quickly as possible.

With the "files" module this is a breeze with a single server.  It just eats it up and everything comes back in a few minutes. Importantly, our testing shows the design goal of 250K users would also be met with one server.

But with "rlm_sql" and MySQL we could not do it. The radiusd would start slowly grinding to a halt roughly as we reached 200 auths per sec (with EAP, this is about 30 devices per sec).  The radiusd log reported "Unresponsive child" in a MySQL module and gradually all the database concurrency would disappear as those threads were lost for further work.

After a lot of effort testing and experimenting with all sorts of things to isolate or avoid this problem, we did get a lot of improvement. But mostly what we achieved was a drop in the probability of losing threads. Inevitably the next larger network-outage event would re-trigger the issue.

With our new far simpler approach, all of this has gone away because we are now using the "files" module and "users" file directly. The speed of authentication is essentially as per that module.

Our new module adds an extra attribute to the Access-Request prior to it being processed by module "files".  The extra attribute can be any text attribute (we use "Reply-Message" to be perverse) and can have any value.  Normal "files" matching (typically used DEFAULT entries) is used to determine the attributes in the Access-Response.

The value of the extra attribute is in essence obtained like this:
1. Format a filename such as "/blah/%{Username}"
2. Read a line from this file

We only have about 10 different values in these files: things like "voip-customer", "payment-overdue", "gold-customer", "exceeded-download-limit", etc.  The value is used to select a DEFAULT entry in the "users" file that builds the reply attributes needed to configure the customers service.

This adds marginal overhead so performance is barely different to a vanilla "files" module.  The cost is one i-node per customer and a few 100 lines of C code. We are more than happy with that cost.

Outside calls to FreeRADIUS code, the module pretty much just calls "fopen", "fgets" and "fclose". So it's dreadfully simple and doesn't have any concerns with thread safety, locking, race conditions, etc.

> 
> > With "rlm_sql" it would take an hour or two only then with careful (and
> > human driven) rate management.
> 
>   I'm not sure what that means.  An hour or two to load SQL?  What is it
> doing?
> 

This happens when we have a major network event that causes lots of devices to simultaneously request authentication. Due to the unpredictable loss of threads, we have to manually manage the rate of the incoming authentications by slowly starting small sections of the network at a time.

This process takes us hours of careful (manual) rate management.

> > The main issues driving this delay were:
> > - "rlm_sql" calls during EAP negotation instead of just at the end of
> EAP
> 
>   That can be fixed without a new module.
> 

Possibly, but we couldn't find a way. We would be keen to understand the fix for this.

> > - Performance issues on our MySQL backend that we didn't have budget to
> > resolve
> > - Thread lock-up's inside MySQL library yet no MySQL server queries were
> > active
> 
>   I've seen lots of people running MySQL with 300K+ users, and no
> problems.  The system needs to be designed carefully, but it *does* work.
> 

We had no problem during normal operation.  It was only when large numbers of devices (typically 10K or more) simultaneously needed to re-join the network for some reason. 

Do you know if these other sites have those kinds of events?

> 
> It really sounds like your *architecture* is wrong.  Find that and fix it.

I don't agree. We are not simply hitting a performance limit. That did happen, but it was resolved by using:
- proxy FreeRADIUS instances to do some hashing load-balancing
- separate auth and acct servers
- mysql index, query & deployment tuning

The performance achieved was acceptable (but nowhere near "files").

However, the stability issue would never go away. To me it smells of a race condition somewhere in the MySQL library. As we could only ever reproduce it by cycling 10K or more users, it was proving very difficult to debug.

> Writing a new module should *not* be necessary.
> 

Possibly agree.  Finding and fixing the bug that caused threads to disappear would probably have been better.

But we spent far less time coding & testing a few 100 lines of "C" code than all the effort over the previous 18 months trying to reproduce, isolate or workaround the MySQL problem.  We gave up.

A nice bonus is that we can now head towards a single server configuration with a file-system database. This will allow us to retire a raft of servers doing proxying, multiple radiusd, and multiple MySQL instances.

Cheers,

Claude.