Status-Server requests are blocked if an Access-Request is waiting for downstream service to respond
Alan DeKok
aland at deployingradius.com
Thu Nov 12 13:44:44 CET 2020
On Nov 12, 2020, at 1:23 AM, Ignacio Arces <ignacio.arces at gmail.com> wrote:
>
> I'm running a containerized FreeRADIUS server v3.0.19 with a custom
> authentication module written in C language that authenticates users
> through a HTTP API.
v3 has rlm_rest, which should be good enough for most purposes.
> We recently experienced an outage in the auth API and since we didn't have
> timeouts properly configured in the curl calls in our custom C module, the
> requests were hanging indefinitely.
Yes, that's the downside of a blocking design. :(
> When this happened, we also noticed
> that our containerized server was restarted by Docker as the container was
> set to "Unhealthy" state, so the health checks were failing.
> Troubleshooting the health checks we found that Status-Server requests were
> not responding while the auth request was hanging waiting for the auth API
> to respond.
Yes. That's how it works. The Status-Server packets are processed by the same threads which process the Access-Requests. So if all of those threads are blocked, then Status-Server packets are also blocked.
> Now that we have a 10s timeout properly configured in our curl requests, we
> have mitigated the undesired restarts but we still can understand why even
> a single stuck auth request is impacting Status-Server request.
If *one* Access-Request packet is blocked, then other threads can still process Status-Server. So no, you don't see a "single stuck auth request impacting Status-Server".
The goal of Status-Server is to see if the server is up and *working*. Maybe the server is running, but is unable to process any packets. In that case, yes, you *do* want it to stop processing Status-Server.
This situation also falls into the standard design requirements for RADIUS: If the RADIUS server is critical, then _any_ system which is used by RADIUS is also critical. Make sure that those systems are (a) up, and (b) responsive.
It makes zero sense to have a back-end database (or REST API) take 10 seconds to respond to a request. The solution here isn't to hack up the RADIUS server to do something magical. The solution is to make the back-end system *not* crap.
Alan DeKok.
More information about the Freeradius-Users
mailing list