mysql and utf8 handling in FreeRADIUS 3.2.x

Thu Apr 3 11:35:59 UTC 2025

Not sure what to do about this so I'm not proposing any solution.  Just
wanted to describe the problem for future reference.  Any maybe someone
smarter knows how to "fix" it.  WHich is mostly about helping users
DTRT.

Our FR config does stupid(?) stuff like converting binary attributes to
strings and then saving the result to varchar fields in a mysql
database.  The assumption is that the binary attribute (the actual
example is ADSL-Agent-Circuit-Id) really contains ascii strings.  Which
it does most of the time.  There are of course some exceptions with an
assorted set of L2 devices on all the different access technoligies you
can think of.  These are escaped as quoted-printable chars by FR so we
don't need to care.

For historical reasons, the mysql table we write this field to has
DEFAULT CHARSET=latin1

After an OS upgrade (to RHEL9, FWIW) I ended up with a mysql client
library defaulting to utf8 connections.  Should not be a problem since
mysql handles the translation.  But then I hit some of those binary
sequencence which FreeRADIUS happend to validate as multibyte utf8, and
which therefore was NOT escaped.

However, mysql didn't know how to translate them into latin1 and bailed
out with:

 rlm_sql_mysql: ERROR 1366 (Incorrect string value: '\xDF\xA4k=5C...' for column 'ADSLAgentCircuitId' at row 1): HY000

I know I shot myself in the foot. There are several workarounds for this
problem, and I have implemented one that works for me as suggested in
the default configuration files:

Create a my.cnf file with

 [freeradius]
 default-character-set = latin1

and set MYSQL_HOME to the directory with this file before starting FR.

But the reason I am posting this is that the sql_escape_func() in
rlm_sql suprised me, and I'm not sure the logic there makes sense?  By
default it escapes any character outside the "safe_characters" set:

 "@abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.-_: /"

Except that it doesn't touch any byte sequence which validates as
multi-byte utf8:

		/*
		 *	Allow all multi-byte UTF8 characters.
		 */
		utf8_len = fr_utf8_char((uint8_t const *) in, -1);
		if (utf8_len > 1) {
			if (outlen <= utf8_len) break;

			memcpy(out, in, utf8_len);
			in += utf8_len;
			out += utf8_len;

			outlen -= utf8_len;
			len += utf8_len;
			continue;
		}

And fr_utf8_char() implements
https://datatracker.ietf.org/doc/html/rfc2279#section-2
correctly as far as I can see.  Meaning that for example any
two byte sequence matching the bit pattern

 "110xxxxx 10xxxxxx"

is accepted as utf8 and therefor not escaped.  My example above
'\xDF\xA4' matches this pattern.  So the lack of escaping is expected.

This works as long as client and table use the same character set.  But
as shown, it fails when there is a mismatch.

The failure can be easily demonstrated with a command line client using
a utf8 connection:

root at 127.0.0.1 (foo)> \s
--------------
mysql  Ver 8.0.41 for Linux on x86_64 (MySQL Community Server - GPL)

Connection id:          12610
Current database:       foo
Current user:           root at localhost
SSL:                    Cipher in use is TLS_AES_256_GCM_SHA384
Current pager:          less
Using outfile:          ''
Using delimiter:        ;
Server version:         8.0.41 MySQL Community Server - GPL
Protocol version:       10
Connection:             127.0.0.1 via TCP/IP
Server characterset:    utf8mb4
Db     characterset:    utf8mb4
Client characterset:    utf8mb4
Conn.  characterset:    utf8mb4
TCP port:               15042
Binary data as:         Hexadecimal
Uptime:                 8 days 23 hours 17 min 24 sec

Threads: 12  Questions: 160166  Slow queries: 0  Opens: 275  Flush tables: 3  Open tables: 192  Queries per second avg: 0.206
--------------

Using two identical tables differing only in default charset:

root at 127.0.0.1 (foo)> show create table bar\G
*************************** 1. row ***************************
       Table: bar
Create Table: CREATE TABLE `bar` (
  `baz` varchar(64) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

root at 127.0.0.1 (foo)> show create table bar2\G
*************************** 1. row ***************************
       Table: bar2
Create Table: CREATE TABLE `bar2` (
  `baz` varchar(64) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
1 row in set (0.00 sec)

We can insert \uDFA4 into the second table, but not into the first one:

root at 127.0.0.1 (foo)> insert into bar values ('ߤ');
ERROR 1366 (HY000): Incorrect string value: '\xDF\xA4' for column 'baz' at row 1
root at 127.0.0.1 (foo)> insert into bar2 values ('ߤ');
Query OK, 1 row affected (0.00 sec)

I have several questions after hitting this hard:

Should FR handle the situation better, detecting the character set
mismatches and automatically escape utf8 multi-byte sequences when the
target table use a different charset?

Should there be a way to configure rlm_sql into always escaping utf8
multi-byte sequences?

Does the default aggressive ascii escaping really align with allowing
any multi-byte utf8?  The end results look strange and unexpected IMHO.

Any thoughts outside "don't to that then"?

Bjørn