intermittent auth issue (proxy: request is no longer in proxy hash)

Fri Jul 4 01:55:55 CEST 2014

On Thu, Jul 3, 2014 at 6:30 PM, Alan DeKok <aland at deployingradius.com> wrote:
> Chris Knipe wrote:
>> Ok.  So what is the value of the timeout? Is it configurable, and if
>> not, can we get a option to configure the timeout value? I don't see
>> anything in proxy.conf, so my apologies if I missed this.
>
>   Then you deleted it.  The configuration items are in the default
> proxy.conf file, with documentation.
>
>   Also, the client may time out.  You can't control that.

I looked at the -default- proxy.conf, that comes with FR3.0.3 which is
what I am using.  There isn't anything there in terms of request
timeouts, just timeouts which controls when the proxy goes into a dead
and/or zombie state - which is not what we are discussing here (again
sorry if I missed it - a reference may thus be a good option). The
proxy is never marked as dead, nor as a zombie, otherwise the natural
thing for me WOULD HAVE been to look at and consider timeouts - that's
why zombie / dead server configurations are there, isn't it?

The client that sends the request, the FR server that does the
proxing, as well as the FR server that receives the proxied request is
under my control and neither one of them are timing out.

Request:
My Application -> FR Proxy -> FR Client, works fine
FR Client -> FR Proxy -> My Application, and the FR Proxy then gets
this weird behaviour when it decides on it's own when to, and when not
to execute post_proxy.

>>  All the
>> requests also generally come back to me at the same time (as posted
>> previously in the logs).  Why is post_proxy then executed for one
>> request, but not the other?  They arrive back at FR at the same time,
>> baring a few us...
>
>   That's what the debug log is for.

Please try and be helpful. I would appreciate it.  I *DID* post the
debug logs, both of a request that works, as well as a request that
does not work.  It is clear that for SOME reason, FR executes
post_proxy in one request, and not in another.  The attributes
received in both packets are the same.  This is NOT a configuration
issue, nor a timeout issue IMHO.  If I knew why FR is behaving like
this, I would have fixed it.  I am here however, to seek assistance to
try and find out WHY this is happening.

>> If you read the two responses posted to the packets, you will also see
>> that both responses gave the proxy hash error, yet the one is still
>> passed through post_proxy, and the other is not..
>
>   That shouldn't happen.  Please try the v3.0.x branch from git.

But it *IS* happening - that's the whole point of why I am here asking
'experts' as to WHY it is happening.  I will however try the latest
code and see whether the issue persist.

I have since modified my rlm_perl application to use the authorize
section to proxy requests to the parent radius server myself.  In
other words, I am no longer using FR's proxy implementation, but
rather do it myself using perl.  It's been running like that now for
close to 12 hours, and I have not had *one* single Authentication
error.  There is *something* in FR's proxy code causing this issue,
and it's related to high amounts of "concurrent" requests for the same
username. I am sending the exact same packet via authorize, instead of
the FR proxy modules, and it's working.  It is thus NOT a timeout
issue, it is NOT related to attributes received / not received, but
rather, related to SOMETHING that the proxy modules does / does not do
when a response is received.  Again, as Phil said, it's more than
likely too soon to call a bug, but it sure is leaning towards that and
looking like it.

Phil in terms of your questions, if I may rather just reply to
everything in one email...

Let's first get some background.

The authentication and accounting is used for user authentication
pertaining to NNTP server software - it is NOT NAS orientated.  The
NNTP Server is written by ourselves, and for local non proxied realms,
authentication works without any problem whatsoever.

Now, when a user connects, the user is entitled to x amount of
concurrent threads to the NNTP server (typically 30).  Generally, the
client applications used by the customer connects all threads at the
same time to the server.  Hence, in a VERY short period of time (sub
10uS) 30 odd identical authentication requests are sent off to FR.
The only attribute that is different in the authentication request, is
the NAS-Port-Id, which we assign the value of the pid of the thread
that the client has started on the NNTP servers.

A typical authentication request will look like this:
Fri Jul  4 00:30:25 2014
Packet-Type = Access-Request
Called-Station-Id = 'a.b.8.42'
Calling-Station-Id = 'c.d.96.24'
NAS-IP-Address = a.b.240.36
NAS-Port-Type = Virtual
NAS-Port-Id = '37713'
Service-Type = Authenticate-Only
User-Name = 'user at local.realm'
User-Password = 'password'

FR then accepts the packet, and almost exclusively relies on rlm_perl
to authenticate the user.  rlm_perl also configures various items such
as rate limits and does session verification on a custom hardware load
balancer.  Typically, rlm_perl returns Auth-Type Accept (with reason),
and a Configuration-Token attribute that gets sent back to the NNTP
server.  rlm_perl handles authentication generally under 0.010 seconds
(measured from where FR passes the request to rlm_perl, until when
rlm_perl executes the return back to FR).

A typical response to an access-request looks like this (this is
transmitted back to the NNTP software stack):
Fri Jul  4 00:34:54 2014
Packet-Type = Access-Accept
Configuration-Token =
'OK:37F46FA8-E538-497D-8B05-8D48FF2EC745:user at local.realm:T:1406584800:30:Welcome,
user at local.realm (1/30)'

This entire authentication process on local non-proxied realms works
absolutely flawlessly.  There is nothing wrong with this at all, it's
fast, the custom code is highly optimized and we're extremely happy
with this.

When we now get to proxied requests, we have resellers that use our
NNTP servers as well.  Our NNTP servers sends the same authentication
request to our FR server, the FR server sees that the realm needs to
go to a proxy server, and then does the following:

The packet passes through attr.filter, where we re-write a few
attributes, as well as remove various unneeded attributes:
my attr_filter.pre-proxy contains:
DEFAULT
  Called-Station-Id == a.b.8.42,
  Calling-Station-Id =* ANY,
  NAS-IP-Address := a.b.8.44,
  NAS-Port-Id =* ANY,
  NAS-Port-Type == Virtual,
  Service-Type == Authenticate-Only,
  User-Name =* ANY,
  User-Password =* ANY,
  Proxy-State =* ANY

FR then sends the proxy request off directly to the remote FR server
(let's leave radsecproxy out of this for now. It was added as I
thought the issue may have been caused by packet loss but the issue
remains the same, whether the proxied request goes direct to the
remote FR server or when it goes via radsecproxy).  The auth request
proxied generally looks like this:
Thu Jul  3 00:02:31 2014
Packet-Type = Access-Request
NAS-Port-Type = Virtual
Service-Type = Authenticate-Only
Calling-Station-Id = 'c.d.8.74'
Called-Station-Id = 'a.b.8.42'
User-Name = 'user at remote.realm'
User-Password = 'aquiaP1eec'
Realm = 'remote.realm'
NAS-IP-Address = a.b.240.36
Message-Authenticator := 0x00
Proxy-State = 0x3438

All the authentication requests transmitted, DOES make it to the
remote FR server, and the remote FR server is a STOCK STANDARD 2.1.12
install that comes with Ubuntu.  All usernames & passwords in the
users file.  There's NOTHING changed, and NOTHING special.  A stock
standard install with just the parent FR server sending it requests
configured in clients.conf.  The server obviously responds virtually
instantly with an access accept, as such:
Thu Jul  3 02:16:50 2014
Packet-Type = Access-Accept
Configuration-Token = 'OK:100:15:V:26843545600'
Reply-Message = 'Welcome'
Service-Type = Authenticate-Only
Proxy-State = 0x3339

Now, THIS is where I think the issue is coming in.  The accept packet
from the remote server, will *always* be identical.  The
Configuration-Token returned will change over time, but for all
initial 30 or 40 "simultaneous" authentication requests, the same
Configuration-Token is returned.  The main FR server thus sees 30
identical packets, to 30 different requests.

What happens now, is that FR will take the 1st response it receives,
and will push that through post_proxy.  The remaining responses for
some reason goes directly to post_auth, instead of what should be,
post_proxy.  The problem ONLY presents itself in requests sent to
proxied servers, and it ONLY presents itself when there is a high
amount of "concurrent" requests... Say, more than 10 within a very
short period of time.  This is essentially what I tried to show by
posting the radsecproxy log, together with the log from my perl
module.

The Configuration-Token returned by the proxied server, and the
Configuration-Token expected by our NNTP servers has different
formats.  The idea thus, is that the FR server *must* execute
post_proxy for each and every response it receives from a proxied
request, so that the sessions can be configured / confirmed on the
load balancers, before the Configuration-Token is then re-written
before being transmitted back to the application.  Again, for
intermediate requests, this works absolutely fine, and FR executes
post_proxy as we all expect.  This behaviour is *only* seen on a large
amount of concurrent requests, and I suspect, is triggered when
identical auth accepts are returned from the proxy server. This is
somewhat irrelevant though, as my post_proxy rlm_perl module is
useless seeing that post_proxy section is not executed in it's
*entirety* when FR does this, not even post_proxy_log is called...

--
Chris.

-- 

Regards,
Chris Knipe