System frequently stops responding...
Mohamed Lrhazi
Mohamed.Lrhazi at georgetown.edu
Mon Jul 27 00:35:06 CEST 2015
Hello,
So I am observing this issue now, with 3.0.8 (cant use 3.0.9 yet until I
figure out another issue, discussed in other thread).
I can observe this while running in -Xx mode (via a bash script that kills
it evey 1 minute, and starts it anew, also starts annd stops tcpdump
captures)
When the issue occurs, I see for example in tcpdump packets arriving, then
being resent as follows:
5804 16:35:29.443612 32770 10.128.254.50 10.212.11.14
1812 RADIUS 308 amc358 64 Access-Request(1) (id=64,
l=266)
5806 16:35:29.494508 32770 10.128.254.50 10.212.11.14
1812 RADIUS 308 amc358 65 Access-Request(1) (id=65,
l=266)
5808 16:35:29.622049 32770 10.128.254.50 10.212.11.14
1812 RADIUS 308 amc358 66 Access-Request(1) (id=66,
l=266)
5810 16:35:29.717960 32770 10.128.254.50 10.212.11.14
1812 RADIUS 308 amc358 67 Access-Request(1) (id=67,
l=266)
5812 16:35:29.813751 32770 10.128.254.50 10.212.11.14
1812 RADIUS 446 amc358 68 Access-Request(1) (id=68,
l=404)
5814 16:35:29.837813 32770 10.128.254.50 10.212.11.14
1812 RADIUS 308 amc358 69 Access-Request(1) (id=69,
l=266)
5816 16:35:29.856081 32770 10.128.254.50 10.212.11.14
1812 RADIUS 345 amc358 70 Access-Request(1) (id=70,
l=303)
5818 16:35:29.882602 32770 10.128.254.50 10.212.11.14
1812 RADIUS 409 amc358 71 Access-Request(1) (id=71,
l=367)
5824 16:35:29.938559 32770 10.128.254.50 10.212.11.14
1812 RADIUS 345 amc358 72 Access-Request(1) (id=72,
l=303)
5832 16:35:29.962693 32770 10.128.254.50 10.212.11.14
1812 RADIUS 345 amc358 73 Access-Request(1) (id=73,
l=303)
* 6472 16:36:08.820646 32770 10.128.254.50 10.212.11.14
1812 RADIUS 295 amc358 182 Access-Request(1)
(id=182, l=253) *
6478 16:36:10.923224 32770 10.128.254.50 10.212.11.14
1812 RADIUS 295 amc358 182 Access-Request(1) (id=182,
l=253), Duplicate Request ID:182
6485 16:36:13.023196 32770 10.128.254.50 10.212.11.14
1812 RADIUS 295 amc358 182 Access-Request(1) (id=182,
l=253), Duplicate Request ID:182
6492 16:36:15.123466 32770 10.128.254.50 10.212.11.14
1812 RADIUS 295 amc358 182 Access-Request(1) (id=182,
l=253)
6500 16:36:17.187964 32770 10.128.254.50 10.212.11.14
1812 RADIUS 295 amc358 182 Access-Request(1) (id=182,
l=253), Duplicate Request ID:182
6508 16:36:19.287626 32770 10.128.254.50 10.212.11.14
1812 RADIUS 295 amc358 182 Access-Request(1) (id=182,
l=253), Duplicate Request ID:182
6520 16:36:21.388736 32770 10.128.254.50 10.212.11.14
1812 RADIUS 295 amc358 191 Access-Request(1) (id=191,
l=253)
6550 16:36:23.492189 32770 10.128.254.50 10.212.11.14
1812 RADIUS 295 amc358 191 Access-Request(1) (id=191,
l=253), Duplicate Request ID:191
6555 16:36:24.339371 32770 10.128.254.50 10.212.11.14
1812 RADIUS 295 amc358 196 Access-Request(1) (id=196,
l=253)
6565 16:36:25.592138 32770 10.128.254.50 10.212.11.14
1812 RADIUS 295 amc358 191 Access-Request(1) (id=191,
l=253), Duplicate Request ID:191
6571 16:36:26.439978 32770 10.128.254.50 10.212.11.14
1812 RADIUS 295 amc358 196 Access-Request(1) (id=196,
l=253), Duplicate Request ID:196
6582 16:36:27.696882 32770 10.128.254.50 10.212.11.14
1812 RADIUS 295 amc358 191 Access-Request(1) (id=191,
l=253)
6589 16:36:28.540430 32770 10.128.254.50 10.212.11.14
1812 RADIUS 295 amc358 196 Access-Request(1) (id=196,
l=253), Duplicate Request ID:196
6602 16:36:29.796194 32770 10.128.254.50 10.212.11.14
1812 RADIUS 295 amc358 191 Access-Request(1) (id=191,
l=253), Duplicate Request ID:191
6609 16:36:30.641936 32770 10.128.254.50 10.212.11.14
1812 RADIUS 295 amc358 196 Access-Request(1) (id=196,
l=253)
6620 16:36:31.896351 32770 10.128.254.50 10.212.11.14
1812 RADIUS 295 amc358 191 Access-Request(1) (id=191,
l=253), Duplicate Request ID:191
6628 16:36:32.745147 32770 10.128.254.50 10.212.11.14
1812 RADIUS 295 amc358 196 Access-Request(1) (id=196,
l=253), Duplicate Request ID:196
6638 16:36:34.002058 32770 10.128.254.50 10.212.11.14
1812 RADIUS 295 amc358 207 Access-Request(1) (id=207,
l=253)
6650 16:36:34.844618 32770 10.128.254.50 10.212.11.14
1812 RADIUS 295 amc358 196 Access-Request(1) (id=196,
l=253), Duplicate Request ID:196
6657 16:36:36.100786 32770 10.128.254.50 10.212.11.14
1812 RADIUS 295 amc358 207 Access-Request(1) (id=207,
l=253), Duplicate Request ID:207
6667 16:36:36.949503 32770 10.128.254.50 10.212.11.14
1812 RADIUS 295 amc358 211 Access-Request(1) (id=211,
l=253)
6673 16:36:38.204572 32770 10.128.254.50 10.212.11.14
1812 RADIUS 295 amc358 207 Access-Request(1) (id=207,
l=253), Duplicate Request ID:207
6863 16:36:39.048776 32770 10.128.254.50 10.212.11.14
1812 RADIUS 295 amc358 211 Access-Request(1) (id=211,
l=253), Duplicate Request ID:211
6865 16:36:39.065787 32770 10.128.254.50 10.212.11.14
1812 RADIUS 308 amc358 236 Access-Request(1) (id=236,
l=266)
6962 16:36:40.309559 32770 10.128.254.50 10.212.11.14
1812 RADIUS 295 amc358 207 Access-Request(1) (id=207,
l=253)
7072 16:36:45.225124 32770 10.128.254.50 10.212.11.14
1812 RADIUS 295 amc358 25 Access-Request(1) (id=25,
l=253)
while in the debug output, I see that while the client is transmitting,
there is no log at all of any activity from radiusd :
Sun Jul 26 16:36:08 2015 : Debug: (1567) } # else = noop
Sun Jul 26 16:36:08 2015 : Debug: (1567) } # policy
remove_reply_message_if_eap = noop
Sun Jul 26 16:36:08 2015 : Debug: (1567) } # Post-Auth-Type REJECT =
updated
Sun Jul 26 16:36:08 2015 : Debug: (1567) Sent Access-Reject Id 106 from
10.212.11.14:1812 to 10.128.254.50:32770 length 0
*Sun Jul 26 16:36:08 2015* : Debug: (1567) Finished request
*Sun Jul 26 16:36:38 2015 *: Debug: PING: Zombie period is over for home
server eduroam1
Sun Jul 26 16:36:38 2015 : Proxy: Marking home server 64.57.22.74 port 1812
as dead.
Sun Jul 26 16:36:38 2015 : Debug: PING: Reviving home server eduroam1 in
300 seconds
Sun Jul 26 16:36:38 2015 : Debug: (1702) <done>: Cleaning up request packet
ID 180 with timestamp +281
Sun Jul 26 16:36:38 2015 : Debug: (1703) <done>: Cleaning up request packet
ID 199 with timestamp +282
Sun Jul 26 16:36:38 2015 : Debug: (1704) <done>: Cleaning up request packet
ID 200 with timestamp +282
Sun Jul 26 16:36:38 2015 : Debug: (1705) <done>: Cleaning up request packet
ID 201 with timestamp +282
Sun Jul 26 16:36:38 2015 : Debug: (1706) <done>: Cleaning up request packet
ID 202 with timestamp +282
Sun Jul 26 16:36:38 2015 : Debug: (1707) <done>: Cleaning up request packet
ID 203 with timestamp +282
Sun Jul 26 16:36:38 2015 : Debug: (1708) <done>: Cleaning up request packet
ID 204 with timestamp +282
Sun Jul 26 16:36:38 2015 : Debug: (1709) <done>: Cleaning up request packet
ID 205 with timestamp +282
Sun Jul 26 16:36:38 2015 : Debug: (1710) <done>: Cleaning up request packet
ID 206 with timestamp +283
Sun Jul 26 16:36:38 2015 : Debug: (1711) <done>: Cleaning up request packet
ID 207 with timestamp +283
Sun Jul 26 16:36:38 2015 : Debug: (1712) <done>: Cleaning up request packet
ID 208 with timestamp +283
Sun Jul 26 16:36:38 2015 : Debug: (1713) <done>: Cleaning up request packet
ID 209 with timestamp +283
Sun Jul 26 16:36:38 2015 : Debug: (1714) <done>: Cleaning up request packet
ID 210 with timestamp +283
Sun Jul 26 16:36:38 2015 : Debug: (1715) <done>: Cleaning up request packet
ID 211 with timestamp +283
Sun Jul 26 16:36:38 2015 : Debug: (1716) <delay>: Cleaning up request
packet ID 180 with timestamp +283
Sun Jul 26 16:36:38 2015 : Debug: (1567) <done>: Cleaning up request packet
ID 106 with timestamp +255
Sun Jul 26 16:36:38 2015 : Debug: Waking up in 299.9 seconds.
Sun Jul 26 16:36:38 2015 : Debug: (1717) Received Access-Request Id 181
from 10.128.254.50:32770 to 10.212.11.14:1812 length 285
...
Sun Jul 26 16:36:38 2015 : Debug: (1717) Message-Authenticator =
0x00000000000000000000000000000000
Sun Jul 26 16:36:38 2015 : Debug: (1717) State =
0x8a28e3f68a2ae7f253f84d3656e9a63d
Sun Jul 26 16:36:38 2015 : Debug: (1717) Finished request
Sun Jul 26 16:36:38 2015 : Debug: Waking up in 4.9 seconds.
*Sun Jul 26 16:36:38 2015 : Debug: (1718) Received Access-Request Id 182
from 10.128.254.50:32770 <http://10.128.254.50:32770> to 10.212.11.14:1812
<http://10.212.11.14:1812> length 253*
*Sun Jul 26 16:36:38 2015 : Debug: (1718) User-Name = 'amc358'*
Sun Jul 26 16:36:38 2015 : Debug: (1718) Chargeable-User-Identity = 0x00
...
Sun Jul 26 16:36:38 2015 : Debug: (1718) } # authenticate = handled
Sun Jul 26 16:36:38 2015 : Debug: (1718) Using Post-Auth-Type Challenge
Sun Jul 26 16:36:38 2015 : Debug: (1718) Post-Auth-Type sub-section not
found. Ignoring.
Sun Jul 26 16:36:38 2015 : Debug: (1718) # Executing group from file
/etc/raddb/sites-enabled/default
Sun Jul 26 16:36:38 2015 : Debug: (1718) session-state: Nothing to cache
Sun Jul 26 16:36:38 2015 : Debug: (1718) Sent Access-Challenge Id 182 from
10.212.11.14:1812 to 10.128.254.50:32770 length 0
Sun Jul 26 16:36:38 2015 : Debug: (1718) EAP-Message =
0x0102001604102666c3f0a4e6c47af22d4befc162cf78
Sun Jul 26 16:36:38 2015 : Debug: (1718) Message-Authenticator =
0x00000000000000000000000000000000
Sun Jul 26 16:36:38 2015 : Debug: (1718) State =
0x620ca86c620eace0c9c0fc432ad09a6a
Sun Jul 26 16:36:38 2015 : Debug: (1718) Finished request
Sun Jul 26 16:36:38 2015 : Debug: Waking up in 4.9 seconds.
The system load on the VM during the entire time is never higher than 0.9.
What else can I collect for data at this point?
Thanks a lot,
Mohamed.
On Fri, Jul 24, 2015 at 6:38 AM, Alan DeKok <aland at deployingradius.com>
wrote:
> On Jul 24, 2015, at 1:34 AM, Mohamed Lrhazi <Mohamed.Lrhazi at georgetown.edu>
> wrote:
> > Still still trying to get to the bottom of this issue... to summarize:
> > - Wireless controllers log that RADIUS server (a load balanced VIP), did
> > not respond to a query. this is logged in clusters of dozen or so,
> several
> > times a day.
> > - Using docker containers.. so decided to try without them
> > - Built two VMs, RedHat Enterprise 7, running provided freeradius
> > RPMs. 3.0.4
>
> Please use 3.0.9. We're not going to debug issues which were tracked
> down and fixed six months ago.
>
> > - Sending the quarter of our traffic to this pool of two VMs.
> > - Issue still occurs on these VMs.
> > - I run radiusd in -Xx mode, on both of the RHEL7 VMs, also run a
> > continuous tcdpump, on each VM.
> >
> > - Problem occurrences seem to reliably coincide with:
> > -- tcpdump shows all the requests logged by the controllers having been
> > resent few times (duplicates in wireshark).
> > -- radiusd goes silent (no log at all) for 30 seconds. after which it
> > resumes logging and I presume, working.
>
> And.... what does the debug log say during this time? You should be
> able to correlate timestamps.
>
> If there's *nothing* in the debug output, then most likely is that
> database is locking up, and preventing FreeRADIUS from doing anything
>
> > - radiusd logs a line for each missed query, I think, like so:
> > Error: (7719) Ignoring duplicate packet from client gu_net_10 port 3010 -
> > ID: 96 due to unfinished request in component <core> module
> >
> > -- Spikes in CPU usage (as seen in sar output).
> >
> > What can I do next? to further zoom in on the root cause? Or is this
> pretty
> > clearly CPU starvation? just add more VMs ?
>
> Use 3.0.9.
>
> Alan DeKok.
>
>
> -
> List info/subscribe/unsubscribe? See
> http://www.freeradius.org/list/users.html
>
More information about the Freeradius-Users
mailing list