Status-Server requests are blocked if an Access-Request is waiting for downstream service to respond

Alan DeKok aland at deployingradius.com
Thu Nov 12 13:44:44 CET 2020


On Nov 12, 2020, at 1:23 AM, Ignacio Arces <ignacio.arces at gmail.com> wrote:
> 
> I'm running a containerized FreeRADIUS server v3.0.19 with a custom
> authentication module written in C language that authenticates users
> through a HTTP API.

  v3 has rlm_rest, which should be good enough for most purposes.

> We recently experienced an outage in the auth API and since we didn't have
> timeouts properly configured in the curl calls in our custom C module, the
> requests were hanging indefinitely.

  Yes, that's the downside of a blocking design.  :(

> When this happened, we also noticed
> that our containerized server was restarted by Docker as the container was
> set to "Unhealthy" state, so the health checks were failing.
> Troubleshooting the health checks we found that Status-Server requests were
> not responding while the auth request was hanging waiting for the auth API
> to respond.

  Yes.  That's how it works.  The Status-Server packets are processed by the same threads which process the Access-Requests.  So if all of those threads are blocked, then Status-Server packets are also blocked.

> Now that we have a 10s timeout properly configured in our curl requests, we
> have mitigated the undesired restarts but we still can understand why even
> a single stuck auth request is impacting Status-Server request.

  If *one* Access-Request packet is blocked, then other threads can still process Status-Server.  So no, you don't see a "single stuck auth request impacting Status-Server".

  The goal of Status-Server is to see if the server is up and *working*.  Maybe the server is running, but is unable to process any packets.  In that case, yes, you *do* want it to stop processing Status-Server.

  This situation also falls into the standard design requirements for RADIUS: If the RADIUS server is critical, then _any_ system which is used by RADIUS is also critical.  Make sure that those systems are (a) up, and (b) responsive.

  It makes zero sense to have a back-end database (or REST API) take 10 seconds to respond to a request.  The solution here isn't to hack up the RADIUS server to do something magical.  The solution is to make the back-end system *not* crap.

  Alan DeKok.





More information about the Freeradius-Users mailing list