System frequently stops responding...

Mohamed Lrhazi Mohamed.Lrhazi at georgetown.edu
Tue Jul 28 19:00:59 CEST 2015


Can slowness of a remote server (eduroam in this case)  cause this behavior
am observing here? where the whole radiusd seems to freeze up for 30
seconds?
I've been looking at my LDAP server logs, and do not find signs of any
issues, no requests seem to have taken more than a second (no sub second
timestamps in the logs)....

Thanks,
Mohamed.

On Tue, Jul 28, 2015 at 10:28 AM, Mohamed Lrhazi <
Mohamed.Lrhazi at georgetown.edu> wrote:

> Here is the similar trace, with -T, for logging the time each syscall
> took.. and apparently none actually took the missing 30 seconds!!
>
>
> 18916 08:38:42.908889 write(3, "Tue Jul 28 08:38:42 2015 : Debug"..., 92)
> = 92 <0.000010>
> 18916 08:38:42.908937 write(3, "Tue Jul 28 08:38:42 2015 : Debug"..., 80)
> = 80 <0.000009>
> 18916 08:38:42.908982 write(3, "Tue Jul 28 08:38:42 2015 : Debug"..., 122)
> = 122 <0.000009>
> 18916 08:38:42.909023 sendmsg(26, {msg_name(16)={sa_family=AF_INET,
> sin_port=htons(32770), sin_addr=inet_addr("10.128.254.50")},
> msg_iov(1)=[{"\3\37\0\24\3558.7c\261yC\325\344+I\257\271\234\215", 20}],
> msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_IP, cmsg_type=, ...},
> msg_flags=0}, 0) = 20 <0.000056>
> 18916 08:38:42.909134 write(3, "Tue Jul 28 08:38:42 2015 : Debug"..., 59)
> = 59 <0.000011>
> 18916 *08:38:42.909176* wait4(-1, NULL, WNOHANG, NULL) = -1 ECHILD (No
> child processes) <*0.000019*>
> 18916 *08:39:12.500099 *write(3, "Tue Jul 28 08:39:12 2015 : Debug"...,
> 87) = 87 <0.000017>
> 18916 08:39:12.500222 write(3, "Tue Jul 28 08:39:12 2015 : Proxy"..., 85)
> = 85 <0.000011>
> 18916 08:39:12.500274 write(3, "Tue Jul 28 08:39:12 2015 : Debug"..., 85)
> = 85 <0.000010>
>
>
> On Mon, Jul 27, 2015 at 11:07 PM, Mohamed Lrhazi <
> Mohamed.Lrhazi at georgetown.edu> wrote:
>
>> so, I run radiusd -Xx under strace... and it looks like I caught the
>> event... not sure what to make of strace though :
>>
>> This is from the radiusd log:
>>
>> Mon Jul 27 21:54:39 2015 : Debug: (31451)   } # Post-Auth-Type REJECT =
>> updated
>> Mon Jul 27 21:54:39 2015 : Debug: (31451) Sent Access-Reject Id 250 from
>> 10.212.11.14:1812 to 10.128.254.52:32770 length 0
>> *Mon Jul 27 21:54:39 2015 :* Debug: (31451) Finished request
>> *Mon Jul 27 21:55:09 2015 :* Debug: PING: Zombie period is over for home
>> server eduroam1
>> Mon Jul 27 21:55:09 2015 : Proxy: Marking home server 64.57.22.74 port
>> 1812 as dead.
>> Mon Jul 27 21:55:09 2015 : Debug: PING: Reviving home server eduroam1 in
>> 300 seconds
>> Mon Jul 27 21:55:09 2015 : Debug: (31664) <done>: Cleaning up request
>> packet ID 71 with timestamp +4030
>> Mon Jul 27 21:55:09 2015 : Debug: (31665) <done>: Cleaning up request
>> packet ID 72 with timestamp +4030
>> Mon Jul 27 21:55:09 2015 : Debug: (31666) <done>: Cleaning up request
>> packet ID 73 with timestamp +4030
>> Mon Jul 27 21:55:09 2015 : Debug: (31667) <done>: Cleaning up request
>> packet ID 74 with timestamp +4030
>> Mon Jul 27 21:55:09 2015 : Debug: (31759) <delay>: Sending delayed
>> response
>> Mon Jul 27 21:55:09 2015 : Debug: (31759) <delay>: Sent Access-Reject Id
>> 208 from 10.212.11.14:1812 to 10.128.254.50:32770 length 20
>>
>>
>> and this from the strace:
>>
>> 21:54:39.637272 write(3, "Mon Jul 27 21:54:39 2015 : Debug"..., 64) = 64
>> 21:54:39.637307 write(3, "Mon Jul 27 21:54:39 2015 : Debug"..., 64) = 64
>> 21:54:39.637343 write(3, "Mon Jul 27 21:54:39 2015 : Debug"..., 92) = 92
>> 21:54:39.637379 write(3, "Mon Jul 27 21:54:39 2015 : Debug"..., 80) = 80
>> 21:54:39.637426 write(3, "Mon Jul 27 21:54:39 2015 : Debug"..., 123) = 123
>> 21:54:39.637476 sendmsg(26, {msg_name(16)={sa_family=AF_INET,
>> sin_port=htons(32770), sin_addr=inet_addr("10.128.254.52")},
>> msg_iov(1)=[{"\3\372\0\24\331\36sKR)\216\23?\263.\17\356#pR", 20}],
>> msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_IP, cmsg_type=, ...},
>> msg_flags=0}, 0) = 20
>> 21:54:39.637593 write(3, "Mon Jul 27 21:54:39 2015 : Debug"..., 59) = 59
>> *21:54:39.637630 *wait4(-1, NULL, WNOHANG, NULL) = -1 ECHILD (No child
>> processes)
>> *21:55:09.500134* write(3, "Mon Jul 27 21:55:09 2015 : Debug"..., 87) =
>> 87
>> 21:55:09.500311 write(3, "Mon Jul 27 21:55:09 2015 : Proxy"..., 85) = 85
>> 21:55:09.500369 write(3, "Mon Jul 27 21:55:09 2015 : Debug"..., 85) = 85
>> 21:55:09.500431 write(3, "Mon Jul 27 21:55:09 2015 : Debug"..., 104) = 104
>>
>>
>> and it does look like each time this "freeze" happens, it takes exactly
>> 30 seconds?
>>
>>
>> Thanks,
>> Mohamed.
>>
>> On Sun, Jul 26, 2015 at 6:57 PM, Mohamed Lrhazi <
>> Mohamed.Lrhazi at georgetown.edu> wrote:
>>
>>> PsPageFaults spikes for radiusd process, coinciding with the issue.
>>>
>>> On Sun, Jul 26, 2015 at 6:52 PM, Mohamed Lrhazi <
>>> Mohamed.Lrhazi at georgetown.edu> wrote:
>>>
>>>> I installed collected on this system, and the one metric that seems to
>>>> spike at the same time as the issue, is PsPageFaults for radiusd process.
>>>>
>>>> On Sun, Jul 26, 2015 at 6:41 PM, Mohamed Lrhazi <
>>>> Mohamed.Lrhazi at georgetown.edu> wrote:
>>>>
>>>>> Oh, Alan. B suggested strace... I'll see how I can grab that... I am
>>>>> sure I could not possibly run radiusd under tracing the whole time right?
>>>>>
>>>>> On Sun, Jul 26, 2015 at 6:36 PM, Mohamed Lrhazi <
>>>>> Mohamed.Lrhazi at georgetown.edu> wrote:
>>>>>
>>>>>>
>>>>>> I can observe this while running in -Xx mode (via a bash script that
>>>>>>> kills it evey 1 minute, and starts it anew, also starts annd stops tcpdump
>>>>>>> captures)
>>>>>>
>>>>>>
>>>>>>
>>>>>> No, every 3600 seconds, actually. Sorry.
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>


More information about the Freeradius-Users mailing list