Detail file handling

Sat May 5 11:08:06 CEST 2007

On Sat 05 May 2007, Alan DeKok wrote:
> Peter Nixon wrote:
> > When running detail2db.pl (the perl import script I posted to the list a
> > few weeks ago) which only uses a single thread and a single DB socket,
> > from a remote machine (same machine as freeradius, not same as
> > PostgreSQL) the DB server load spikes to 2+ and the DB sometimes
> > responds slow enough that FreeRADIUS Auth queries fail and errors are
> > logged (not always.. depends on traffic levels)
>
>   Ouch.  That's not nice.  I had thought that the priority queuing would
> ensure that authentication packets get handled before detail files.
>
>   On the other hand, if the CPU is pegged from handling the detail
> packets, it won't have time to notice that an authentication packet has
> arrived.
>
> > For that reason I have added a 3000 usec sleep in between processing of
> > each packet, which keeps the DB load at about 0.8 while processing a
> > months worth of detail files. Note that perl should be a bit slower than
> > C so a good default sleep time is probably around 5000usec..
>
>   Hmm... Hard-coded sleep times don't adapt well to changing
> circumstances.  Maybe changing the code to require *2* waiting threads
> would be better.  That way, the "max active threads" limit wouldn't be
> reached.  Since that's not reached, more threads won't ever be created
> to handle the flood of detail packets.  And, there will always be one
> waiting thread, which can handle any authentication packet that comes in.

Hi Alan

I think you may have misunderstood the problem and my solution slightly. The 
CPU on the RADIUS servers NEVER gets pegged. Its the system load on the 
backend DB server which goes high, slowing everything down. Now in fact its 
not even the CPU that is being pegged, but rather the hard disk(s). 
Now, you can always add more, faster harddisks and try to trim down the 
amount of data you keep in your DB (so you seek less, and have smaller 
indexes), but that fact remains that even with a dedicated DB server like I 
have, your hard disks will NEVER be able to keep up with a similar speed 
RADIUS server poking at it full speed with multiple threads, unless you have 
at least one disk per radius thread in a RAID mirror so that each thread 
gets its own disk head to seek with. This is an unlikely situation as its 
extremely expensive.

Having 2 waiting threads before the detail file reader kicks in is a good 
idea.. Maybe it should even be 3, or configurable... But it still doesn't 
solve the problem above..

I have the following queries hitting the database during normal operation:

Auth-Request:
    authorize_check_query  - select from stored procedure with multiple joins 
and selects
   authorize_reply_query  - select from stored procedure with multiple joins 
and selects
   authorize_group_check_query - default FR select which includes join
   authorize_group_reply_query - default FR select which includes join
   authenticate_query - never gets hit in my setup...
   duplicate_session_killer - single select from radacct
   sqlippool_dynamic - modified default FR select
   sqlippool_static - modified default FR select
   group_membership_query - default FR select
   postauth_query - default FR insert

Now, this chunk of queries happens for EVERY auth packet. At the same time, 
on other threads you could have multiple Accounting start, update or stop 
packets which are poking new data into both the radacct table as well as the 
sqlippool tables slowing things down by locking bits of those tables and 
doing index updates...

One possible optimisation I could make it to have my duplicate_session_killer 
use data from the sqlippool tables (which are fixed in length and therefore 
should be faster to update indexes) than radacct. I need t benchmark this 
though, and haven't seen the need yet to break a working system.

My current config uses 2 separate sql modules, one for auth queries and one 
for acct. Both hit the same database. This ensures that auth can never 
overwhelm acct and visa versa.. (Maybe some of your new changes will make 
this redundant but it certainly improved things when I did it 6 months ago)

Now, whileever you have a relatively even mix of radius packets, spread apart 
in time everythings works well. (As I said my DB has a 15min load average of 
0.12 and in normal operation never goes over 0.3)

However when you have a single thread poking accounting data into the system 
at full speed things start to break down. A single thread can push data into 
the radacct table at > 1000 records per second, which pegs the hard disks 
(and works the CPU pretty well also) making the DB spend all its time 
updating indexes (as well as writing to disk of course). 
This works slows down the multi-query auth process to the point where it can 
start to internally time out, and certainly replies late to the NAS.

So, again.. Having the detail reader not inject packets back into the system 
unless there is > X threads free is a good idea, but I still think that that 
reader needs a configurable delay in between each packet that it injects...

Its no the lack of threads thats the issue (although that IS an issue), its 
the amount of load one dedicated thread can create...

Cheers
-- 

Peter Nixon
http://www.peternixon.net/
PGP Key: http://www.peternixon.net/public.asc