System frequently stops responding...

Mohamed Lrhazi Mohamed.Lrhazi at georgetown.edu
Mon Jul 27 00:35:06 CEST 2015


Hello,

So I am observing this issue now, with 3.0.8 (cant use 3.0.9 yet until I
figure out another issue, discussed in other thread).

I can observe this while running in -Xx mode (via a bash script that kills
it evey 1 minute, and starts it anew, also starts annd stops tcpdump
captures)

When the issue occurs, I see for example in tcpdump packets arriving, then
being resent as follows:

   5804 16:35:29.443612 32770    10.128.254.50         10.212.11.14
 1812     RADIUS   308    amc358     64         Access-Request(1) (id=64,
l=266)
   5806 16:35:29.494508 32770    10.128.254.50         10.212.11.14
 1812     RADIUS   308    amc358     65         Access-Request(1) (id=65,
l=266)
   5808 16:35:29.622049 32770    10.128.254.50         10.212.11.14
 1812     RADIUS   308    amc358     66         Access-Request(1) (id=66,
l=266)
   5810 16:35:29.717960 32770    10.128.254.50         10.212.11.14
 1812     RADIUS   308    amc358     67         Access-Request(1) (id=67,
l=266)
   5812 16:35:29.813751 32770    10.128.254.50         10.212.11.14
 1812     RADIUS   446    amc358     68         Access-Request(1) (id=68,
l=404)
   5814 16:35:29.837813 32770    10.128.254.50         10.212.11.14
 1812     RADIUS   308    amc358     69         Access-Request(1) (id=69,
l=266)
   5816 16:35:29.856081 32770    10.128.254.50         10.212.11.14
 1812     RADIUS   345    amc358     70         Access-Request(1) (id=70,
l=303)
   5818 16:35:29.882602 32770    10.128.254.50         10.212.11.14
 1812     RADIUS   409    amc358     71         Access-Request(1) (id=71,
l=367)
   5824 16:35:29.938559 32770    10.128.254.50         10.212.11.14
 1812     RADIUS   345    amc358     72         Access-Request(1) (id=72,
l=303)
   5832 16:35:29.962693 32770    10.128.254.50         10.212.11.14
 1812     RADIUS   345    amc358     73         Access-Request(1) (id=73,
l=303)
*   6472 16:36:08.820646 32770    10.128.254.50         10.212.11.14
   1812     RADIUS   295    amc358     182        Access-Request(1)
(id=182, l=253)                               *
   6478 16:36:10.923224 32770    10.128.254.50         10.212.11.14
 1812     RADIUS   295    amc358     182        Access-Request(1) (id=182,
l=253), Duplicate Request ID:182
   6485 16:36:13.023196 32770    10.128.254.50         10.212.11.14
 1812     RADIUS   295    amc358     182        Access-Request(1) (id=182,
l=253), Duplicate Request ID:182
   6492 16:36:15.123466 32770    10.128.254.50         10.212.11.14
 1812     RADIUS   295    amc358     182        Access-Request(1) (id=182,
l=253)
   6500 16:36:17.187964 32770    10.128.254.50         10.212.11.14
 1812     RADIUS   295    amc358     182        Access-Request(1) (id=182,
l=253), Duplicate Request ID:182
   6508 16:36:19.287626 32770    10.128.254.50         10.212.11.14
 1812     RADIUS   295    amc358     182        Access-Request(1) (id=182,
l=253), Duplicate Request ID:182
   6520 16:36:21.388736 32770    10.128.254.50         10.212.11.14
 1812     RADIUS   295    amc358     191        Access-Request(1) (id=191,
l=253)
   6550 16:36:23.492189 32770    10.128.254.50         10.212.11.14
 1812     RADIUS   295    amc358     191        Access-Request(1) (id=191,
l=253), Duplicate Request ID:191
   6555 16:36:24.339371 32770    10.128.254.50         10.212.11.14
 1812     RADIUS   295    amc358     196        Access-Request(1) (id=196,
l=253)
   6565 16:36:25.592138 32770    10.128.254.50         10.212.11.14
 1812     RADIUS   295    amc358     191        Access-Request(1) (id=191,
l=253), Duplicate Request ID:191
   6571 16:36:26.439978 32770    10.128.254.50         10.212.11.14
 1812     RADIUS   295    amc358     196        Access-Request(1) (id=196,
l=253), Duplicate Request ID:196
   6582 16:36:27.696882 32770    10.128.254.50         10.212.11.14
 1812     RADIUS   295    amc358     191        Access-Request(1) (id=191,
l=253)
   6589 16:36:28.540430 32770    10.128.254.50         10.212.11.14
 1812     RADIUS   295    amc358     196        Access-Request(1) (id=196,
l=253), Duplicate Request ID:196
   6602 16:36:29.796194 32770    10.128.254.50         10.212.11.14
 1812     RADIUS   295    amc358     191        Access-Request(1) (id=191,
l=253), Duplicate Request ID:191
   6609 16:36:30.641936 32770    10.128.254.50         10.212.11.14
 1812     RADIUS   295    amc358     196        Access-Request(1) (id=196,
l=253)
   6620 16:36:31.896351 32770    10.128.254.50         10.212.11.14
 1812     RADIUS   295    amc358     191        Access-Request(1) (id=191,
l=253), Duplicate Request ID:191
   6628 16:36:32.745147 32770    10.128.254.50         10.212.11.14
 1812     RADIUS   295    amc358     196        Access-Request(1) (id=196,
l=253), Duplicate Request ID:196
   6638 16:36:34.002058 32770    10.128.254.50         10.212.11.14
 1812     RADIUS   295    amc358     207        Access-Request(1) (id=207,
l=253)
   6650 16:36:34.844618 32770    10.128.254.50         10.212.11.14
 1812     RADIUS   295    amc358     196        Access-Request(1) (id=196,
l=253), Duplicate Request ID:196
   6657 16:36:36.100786 32770    10.128.254.50         10.212.11.14
 1812     RADIUS   295    amc358     207        Access-Request(1) (id=207,
l=253), Duplicate Request ID:207
   6667 16:36:36.949503 32770    10.128.254.50         10.212.11.14
 1812     RADIUS   295    amc358     211        Access-Request(1) (id=211,
l=253)
   6673 16:36:38.204572 32770    10.128.254.50         10.212.11.14
 1812     RADIUS   295    amc358     207        Access-Request(1) (id=207,
l=253), Duplicate Request ID:207
   6863 16:36:39.048776 32770    10.128.254.50         10.212.11.14
 1812     RADIUS   295    amc358     211        Access-Request(1) (id=211,
l=253), Duplicate Request ID:211
   6865 16:36:39.065787 32770    10.128.254.50         10.212.11.14
 1812     RADIUS   308    amc358     236        Access-Request(1) (id=236,
l=266)
   6962 16:36:40.309559 32770    10.128.254.50         10.212.11.14
 1812     RADIUS   295    amc358     207        Access-Request(1) (id=207,
l=253)
   7072 16:36:45.225124 32770    10.128.254.50         10.212.11.14
 1812     RADIUS   295    amc358     25         Access-Request(1) (id=25,
l=253)


while in the debug output, I see that while the client is transmitting,
there is no log at all of any activity from radiusd :

Sun Jul 26 16:36:08 2015 : Debug: (1567)       } # else = noop
Sun Jul 26 16:36:08 2015 : Debug: (1567)     } # policy
remove_reply_message_if_eap = noop
Sun Jul 26 16:36:08 2015 : Debug: (1567)   } # Post-Auth-Type REJECT =
updated
Sun Jul 26 16:36:08 2015 : Debug: (1567) Sent Access-Reject Id 106 from
10.212.11.14:1812 to 10.128.254.50:32770 length 0
*Sun Jul 26 16:36:08 2015* : Debug: (1567) Finished request
*Sun Jul 26 16:36:38 2015 *: Debug: PING: Zombie period is over for home
server eduroam1
Sun Jul 26 16:36:38 2015 : Proxy: Marking home server 64.57.22.74 port 1812
as dead.
Sun Jul 26 16:36:38 2015 : Debug: PING: Reviving home server eduroam1 in
300 seconds
Sun Jul 26 16:36:38 2015 : Debug: (1702) <done>: Cleaning up request packet
ID 180 with timestamp +281
Sun Jul 26 16:36:38 2015 : Debug: (1703) <done>: Cleaning up request packet
ID 199 with timestamp +282
Sun Jul 26 16:36:38 2015 : Debug: (1704) <done>: Cleaning up request packet
ID 200 with timestamp +282
Sun Jul 26 16:36:38 2015 : Debug: (1705) <done>: Cleaning up request packet
ID 201 with timestamp +282
Sun Jul 26 16:36:38 2015 : Debug: (1706) <done>: Cleaning up request packet
ID 202 with timestamp +282
Sun Jul 26 16:36:38 2015 : Debug: (1707) <done>: Cleaning up request packet
ID 203 with timestamp +282
Sun Jul 26 16:36:38 2015 : Debug: (1708) <done>: Cleaning up request packet
ID 204 with timestamp +282
Sun Jul 26 16:36:38 2015 : Debug: (1709) <done>: Cleaning up request packet
ID 205 with timestamp +282
Sun Jul 26 16:36:38 2015 : Debug: (1710) <done>: Cleaning up request packet
ID 206 with timestamp +283
Sun Jul 26 16:36:38 2015 : Debug: (1711) <done>: Cleaning up request packet
ID 207 with timestamp +283
Sun Jul 26 16:36:38 2015 : Debug: (1712) <done>: Cleaning up request packet
ID 208 with timestamp +283
Sun Jul 26 16:36:38 2015 : Debug: (1713) <done>: Cleaning up request packet
ID 209 with timestamp +283
Sun Jul 26 16:36:38 2015 : Debug: (1714) <done>: Cleaning up request packet
ID 210 with timestamp +283
Sun Jul 26 16:36:38 2015 : Debug: (1715) <done>: Cleaning up request packet
ID 211 with timestamp +283
Sun Jul 26 16:36:38 2015 : Debug: (1716) <delay>: Cleaning up request
packet ID 180 with timestamp +283
Sun Jul 26 16:36:38 2015 : Debug: (1567) <done>: Cleaning up request packet
ID 106 with timestamp +255
Sun Jul 26 16:36:38 2015 : Debug: Waking up in 299.9 seconds.
Sun Jul 26 16:36:38 2015 : Debug: (1717) Received Access-Request Id 181
from 10.128.254.50:32770 to 10.212.11.14:1812 length 285
...
Sun Jul 26 16:36:38 2015 : Debug: (1717)   Message-Authenticator =
0x00000000000000000000000000000000
Sun Jul 26 16:36:38 2015 : Debug: (1717)   State =
0x8a28e3f68a2ae7f253f84d3656e9a63d
Sun Jul 26 16:36:38 2015 : Debug: (1717) Finished request
Sun Jul 26 16:36:38 2015 : Debug: Waking up in 4.9 seconds.
*Sun Jul 26 16:36:38 2015 : Debug: (1718) Received Access-Request Id 182
from 10.128.254.50:32770 <http://10.128.254.50:32770> to 10.212.11.14:1812
<http://10.212.11.14:1812> length 253*
*Sun Jul 26 16:36:38 2015 : Debug: (1718)   User-Name = 'amc358'*
Sun Jul 26 16:36:38 2015 : Debug: (1718)   Chargeable-User-Identity = 0x00
...
Sun Jul 26 16:36:38 2015 : Debug: (1718)   } # authenticate = handled
Sun Jul 26 16:36:38 2015 : Debug: (1718) Using Post-Auth-Type Challenge
Sun Jul 26 16:36:38 2015 : Debug: (1718) Post-Auth-Type sub-section not
found.  Ignoring.
Sun Jul 26 16:36:38 2015 : Debug: (1718) # Executing group from file
/etc/raddb/sites-enabled/default
Sun Jul 26 16:36:38 2015 : Debug: (1718) session-state: Nothing to cache
Sun Jul 26 16:36:38 2015 : Debug: (1718) Sent Access-Challenge Id 182 from
10.212.11.14:1812 to 10.128.254.50:32770 length 0
Sun Jul 26 16:36:38 2015 : Debug: (1718)   EAP-Message =
0x0102001604102666c3f0a4e6c47af22d4befc162cf78
Sun Jul 26 16:36:38 2015 : Debug: (1718)   Message-Authenticator =
0x00000000000000000000000000000000
Sun Jul 26 16:36:38 2015 : Debug: (1718)   State =
0x620ca86c620eace0c9c0fc432ad09a6a
Sun Jul 26 16:36:38 2015 : Debug: (1718) Finished request
Sun Jul 26 16:36:38 2015 : Debug: Waking up in 4.9 seconds.


The system load on the VM during the entire time is never higher than 0.9.

What else can I collect for data at this point?

Thanks a lot,
Mohamed.



On Fri, Jul 24, 2015 at 6:38 AM, Alan DeKok <aland at deployingradius.com>
wrote:

> On Jul 24, 2015, at 1:34 AM, Mohamed Lrhazi <Mohamed.Lrhazi at georgetown.edu>
> wrote:
> > Still still trying to get to the bottom of this issue... to summarize:
> > - Wireless controllers log that RADIUS server (a load balanced VIP), did
> > not respond to a query. this is logged in clusters of dozen or so,
> several
> > times a day.
> > - Using docker containers.. so decided to try without them
> > - Built two VMs, RedHat Enterprise 7, running provided freeradius
> > RPMs. 3.0.4
>
>   Please use 3.0.9.  We're not going to debug issues which were tracked
> down and fixed six months ago.
>
> > - Sending the quarter of our traffic to this pool of two VMs.
> > - Issue still occurs on these VMs.
> > - I run radiusd in -Xx mode, on both of the RHEL7 VMs, also run a
> > continuous tcdpump, on each VM.
> >
> > - Problem occurrences seem to reliably coincide with:
> > -- tcpdump shows all the requests logged by the controllers having been
> > resent few times (duplicates in wireshark).
> > -- radiusd goes silent (no log at all) for 30 seconds. after which it
> > resumes logging and I presume, working.
>
>   And.... what does the debug log say during this time?  You should be
> able to correlate timestamps.
>
>   If there's *nothing* in the debug output, then most likely is that
> database is locking up, and preventing FreeRADIUS from doing anything
>
> > - radiusd logs a line for each missed query, I think, like so:
> > Error: (7719) Ignoring duplicate packet from client gu_net_10 port 3010 -
> > ID: 96 due to unfinished request in component <core> module
> >
> > -- Spikes in CPU usage (as seen in sar output).
> >
> > What can I do next? to further zoom in on the root cause? Or is this
> pretty
> > clearly CPU starvation? just add more VMs ?
>
>  Use 3.0.9.
>
>   Alan DeKok.
>
>
> -
> List info/subscribe/unsubscribe? See
> http://www.freeradius.org/list/users.html
>


More information about the Freeradius-Users mailing list