fr_packet_cmp again

Wed Apr 27 21:17:07 CEST 2011

Hi,

compiled current 2.1.11 branch on RHEL 6.0 and RHEL 5.5, hit it with 
about 7.5K req/s, and it dies on both, after about 1-3.5M requests.
Hit it with about 1000 req/s, it still dies.
It's always segfault in libfreeradius.
(there's more issues on RHEL 6 but they're not critical)
It's got nothing to do with max_request_time; tried changing it just in 
case, but I get response time in milliseconds anyway.
It's not about threads; tried with -s, and it still dies.
I'm reasonably sure it's not about load either, just I can't wait like 
100000 secs for fail.

All the time, I get decent response time (<20ms auth, <200ms acct) and 
no retries. Well at least there's no 'ignoring dupes' repored in 
radius.log. Well there's nothing except 'ready' in radius.log.

In fact, it seems bug #35 is still on:

-----------------
Program received signal SIGSEGV, Segmentation fault.
fr_packet_cmp (a=0x11edc90, b=0xfb0b098dddf10f89) at packet.c:139
139             if (a->sockfd < b->sockfd) return -1;

(gdb) bt
#0  fr_packet_cmp (a=0x11edc90, b=0xfb0b098dddf10f89) at packet.c:139
#1  0x00007ffff7bc551b in list_find (ht=0x85aa80, data=0x7fffffffe1c8)
     at hash.c:191
#2  fr_hash_table_find (ht=0x85aa80, data=0x7fffffffe1c8) at hash.c:454
#3  0x00007ffff7bc5569 in fr_hash_table_finddata (ht=<value optimized out>,
     data=<value optimized out>) at hash.c:484
#4  0x00007ffff7bd30ea in fr_packet_list_find (pl=<value optimized out>,
     request=0x11edc90) at packet.c:583
#5  0x0000000000428569 in received_request (listener=0x8686a0,
     packet=0x11edc90, prequest=0x7fffffffe320, client=0x79d680) at 
event.c:2833
#6  0x0000000000414ee3 in auth_socket_recv (listener=0x8686a0,
     pfun=0x7fffffffe328, prequest=0x7fffffffe320) at listen.c:857
#7  0x00000000004291a0 in event_socket_handler (xel=<value optimized out>,
     fd=<value optimized out>, ctx=0x8686a0) at event.c:3423
#8  0x00007ffff7bd437b in fr_event_loop (el=0x854c50) at event.c:413
#9  0x000000000041beb4 in main (argc=<value optimized out>,
     argv=<value optimized out>) at radiusd.c:408

(gdb) print a=0x11edc90
$1 = (const RADIUS_PACKET *) 0x11edc90
(gdb) print $1->sockfd
$2 = 10
(gdb) print b=0xfb0b098dddf10f89
$3 = (const RADIUS_PACKET *) 0xfb0b098dddf10f89
(gdb) print $3->sockfd
Cannot access memory at address 0xfb0b098dddf10f89
-----------------

Did this a number of times, sometimes it's auth_socket_recv but mostly 
acct_socket_recv.

So it seems something still frees data before yanking list...?
Well, beats me.
Any hints?

Now, how to reproduce:

- get spizd 0.5 from http://sourceforge.net/projects/spizd/files/
- unzip, have java in path
- edit etc/dictionary.txt, enter at least one username:password pair
- run bin/spizd-radius.sh <server>, verify you get Access-Accept
- edit etc/spizd.properties: change verboseThis and verboseThat to 
false, change spizd.circular to true
- run spizd-radius.sh again and wait, it dies about 10-15 mins later
- optionally, increase maxThreads to kill it faster

What this test does: Access-Request, Acct Start, (optionally a number of 
Interim-Update), Acct Stop - sending request imediatelly after response 
to previous request is received, thread per session.

Regards...