ISOCPP liaison List: [isocpp-wg14/wg21-liaison] In preparation for the Brno meeting

From: Niall Douglas <s_sourceforge_at_[hidden]>
Date: Tue, 24 Jun 2025 11:05:30 +0000

Dear WG14,

I am just returned from my last and final WG21 meeting. I have been
replaced on nb-chairs, unsubscribed from all mailing lists bar this one,
and shortly I will be removed from WG21 on the ISO directory. I have
left the Boost C++ libraries mailing lists, and even removed WG21 from
my Reddit flair on /r/cpp!

I am therefore now fully divested from C++, and I have fully turned my
attention to C. You have may noticed the following 'big three' papers
from me recently:

- Standard secure networking
https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3533.pdf

- Modern signals handling
https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3540.pdf

- Lingua franca results
https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3599.pdf

Which brings me to material not really discussed in earnest since
probably covid times, one of which is 'what to do about null terminated
strings in C?'

It was a long time ago now that discussion, so my apologies if I have
misremembered this but that's why I'm writing this email now (also my
thanks to Bengt and Guy for reminding me about this during the WG21
meeting). I think a large majority of WG14 would like a replacement for
null terminated strings in new code, and POSIX has also indicated an
interest if WG14 can come up with something acceptable at the syscall
level. There have been several 'improved C string API' proposals going
around recently, but they are actually orthogonal to this problem:

1. How should we efficiently implement variably lengthed sequences of
octets other than putting a sentinel value at the end?

2. How, additionally, should we allow treating that variably lengthed
sequence of octets - or a subset thereof - as a variably lengthed UTF-8
codepoint sequence?

At the time of that discussion many years ago, I remember various
members of WG14 having strong opinions that:

1. This replacement for `const char *` must work everywhere standard C
works i.e. it cannot use malloc, and it cannot require UTF tables to be
bundled with each binary.

2. This replacement for `const char *` must have a minimum possible
space overhead over null terminated strings.

3. This replacement for `const char *` must have identical format across
architectures so program A can work with strings serialised by program B.

4. This replacement for `const char *` must not damage the useful
properties of a UTF-8 sequence i.e. self synchronisation whereby if
given a pointer into an arbitrary sequence of octets, the current UTF-8
codepoint can always be found easily by scanning backwards or forwards
by up to three octets.

In other words, 'how should we upgrade `const char *` in a backwards
compatible way not getting in the way of UTF-8 parsing?'

At the time of that discussion, I postulated that the unused encoding
space within UTF-8 could be used to describe a maximally space efficient
header to describe the length of a following sequence of octets. So, to
be clear:

1. C would gain a new type `varoctet_t *` which points at the header of
a variably lengthed array of octets.

2. The header would be encoded in a way guaranteed to be invalid UTF-8.

3. The header would describe the length of the octets following the
header and would be one of these sizes: 1 octets, 2 octets, 4 octets or
8 octets. This ensures that the array data would always begin on an
aligned boundary. The smallest header length would be chosen where possible.

4. Following the header octets, N octets of data would follow and N
should be able to exceed the 4 Gb value.

5. Null termination of `varoctet_t *` is optional, and can be easily
tested by inspecting the value at arr[length].

Would WG14 like to see a paper for Brno which investigates header
encoding options to see what the tradeoffs and overheads would be over
null terminated strings?

To be clear, null terminated strings will always be the most storage
efficient (this is why they were chosen!) but they add _considerable_
runtime inefficiency in exchange, never mind security issues. `varoctet`
will always be less storage efficient, but you can avoid scanning the
whole array in many places and it helps prevent a whole class of
security attack.

Niall

Received on 2025-06-24 11:05:32