C++ Logo

sg16

Advanced search

Re: [isocpp-sg16] Thoughts on P4030R0: Endian Views

From: JF Bastien <cxx_at_[hidden]>
Date: Mon, 22 Jun 2026 06:05:00 +0900
On Mon, Jun 22, 2026 at 1:39 AM Corentin Jabot via SG16 <
sg16_at_[hidden]> wrote:

> Let's add mandates for CHAR_BIT == 8 everywhere UTF related. UTF is
> simply not defined in other scenarios.
> The vote on P3477 does not change the fact that the maintenance of a C++
> compiler for such a platform that is less than 2 decades old has not been
> demonstrated. P3477 was certainly aiming to prevent these discussions!
>

Indeed. There’s been no movement in compilers and libraries to support non
8 bit bytes since the vote either.



I think the other concerns are orthogonal. Automagic BOM recognition
> certainly doesn't belong here - even for char8_t - it's not free and
> error prone. An adaptor could be constructed to provide that functionality.
> (there are a couple application-level choices here, that the standard
> should not try to answer even if we provide more convenience: do you
> recognize the BOM?Do you preserve it?)
>
> The examples in the paper show the use of as_char16_t, that would I think
> address your concerns about BE/LE.
> I do think having a sizeof(char16_t) == 2 Mandates somewhere is a good
> idea though. Maybe in as_char16_t ?
> Furthermore, Eddie's examples use uintN_t to model encoding schemes,
> which is what we should encourage - charN_t is meant to represent
> sequences of code units, not serialized data.
>
> I like the idea of an endian view in that, if we want to care about
> UTF16LE/BE. we should do it in exactly one place and design it so people
> don't use it unless they really meant to do so.
> The Unicode recommendation of defaulting to BE is not followed, so in
> practice, most Unicode content is LE and unless you target a BE platform
> (IBM strikes again), you can and should ignore all of that.
>
> On Sun, Jun 21, 2026 at 8:55 AM Tom Honermann via SG16 <
> sg16_at_[hidden]> wrote:
>
>> SG9 discussed P4030R0 (Endian Views) <https://wg21.link/p4030r0> in Brno
>> (notes <https://wiki.isocpp.org/NotesSG9P4030R0>, GH issue with polls
>> <https://github.com/cplusplus/papers/issues/2662#issuecomment-4678412530>)
>> this week and polled to forward it to LEWG. I was the sole dissenter on the
>> poll for reasons described below. SG16 has not yet reviewed the proposal.
>>
>> To be clear, I support the paper, but would like to see additional
>> analysis completed to gain more confidence in the design.
>>
>> My concerns are listed below. Some of these were not discussed in SG9
>> because they didn't occur to me until later.
>>
>> 1. Behavior for implementations for which CHAR_BIT is not 8.
>> 2. Behavior for std::[u]int_least*N*_t and similar type aliases (
>> [cstdint.syn]p3 <https://eel.is/c++draft/cstdint.syn#3>) that alias a
>> type with a value representation that exceeds the bits needed for *N*.
>> Note that the underlying types of char16_t and char32_t are
>> std::uint_least16_t and std::uint_least32_t respectively (
>> [basic.fundamental]p9 <https://eel.is/c++draft/basic.fundamental#9>).
>> 3. Lack of support for all of the Unicode encoding schemes
>> <https://www.unicode.org/versions/Unicode17.0.0/core-spec/chapter-2/#G19273>
>> (byte oriented as in UTF-16BE, UTF-16LE, UTF-32BE, and UTF-32LE; code unit
>> oriented as in UTF-16 and UTF-32 where the code units are either big-endian
>> or little-endian) or discussion of how the encoding schemes that are not
>> directly supported could be provided separately.
>> 4. Lack of support for the Unicode Byte Order Mark (BOM)
>> <https://www.unicode.org/versions/Unicode17.0.0/core-spec/chapter-2/#G9354> or
>> discussion of how support for a BOM could be provided separately.
>>
>> P3477 (There are exactly 8 bits in a byte) <https://wg21.link/p3477> was
>> extensively discussed in EWG and LEWG in 2024-2025 but failed to gain
>> consensus across both groups. The status quo is therefore that bytes may
>> have more than 8 bits and such implementations are known to exist (though
>> none are supported by Clang, GCC, or MSVC; at least not in their upstream
>> repositories). std::byteswap() ([bit.byteswap]
>> <https://eel.is/c++draft/bit.byteswap#lib:byteswap>) does not have
>> differentiated behavior based on CHAR_BIT, but parts of the
>> std::text_encoding identification library do (see [locale.members]p6
>> <https://eel.is/c++draft/locale.members#6> and [text.encoding.members]p11
>> <https://eel.is/c++draft/text.encoding.members#11>,p13
>> <https://eel.is/c++draft/text.encoding.members#13>,p17
>> <https://eel.is/c++draft/text.encoding.members#17>). We *could* follow
>> suit and add CHAR_BIT == 8 mandates to the proposed endian views. If we
>> did, I would encourage adding a similar mandate to std::byteswap() as a
>> DR.
>>
>> std::byteswap() constrains the types it operates on to those that model
>> integral ([bit.byteswap]p1 <https://eel.is/c++draft/bit.byteswap#1>)
>> with an additional mandate for the lack of padding bits ([bit.byteswap]p2
>> <https://eel.is/c++draft/bit.byteswap#2>, but see LWG 4583 (std::byteswap
>> can make sense for some types with padding bytes)
>> <https://cplusplus.github.io/LWG/issue4583>. It has no accommodations
>> for type aliases like std::uint_least16_t that alias a type with a range
>> of values that exceeds that required for the alias. As a result, given a
>> char16_t object *C* with value 0xFEFF, an implementation for which CHAR_BIT
>> == 8 and std::uint_least16_t aliases a 32-bit type will produce a value
>> of 0xFFFE0000 rather than 0xFFFE for the expression std::byteswap(C).
>> This clearly fails the desired behavior for UTF-16 endian conversions.
>> There are several ways this can be addressed:
>>
>> 1. A mandate can be added that types with a specified value range
>> like char16_t have an underlying type of the exact size (e.g., sizeof(char16_t)
>> * CHAR_BIT == 16). That will suffice for common implementations, but
>> excludes support for some known implementations.
>> 2. An additional constant template parameter can be added to specify
>> the number of bytes of the object representation to operate on. A default
>> argument could then supply the correct value (based on a type trait) for
>> types like char16_t while allowing the user to specify an appropriate
>> value for type aliases like std::uint_least16_t. If we choose this
>> approach, I would encourage adding a similar constant template parameter
>> with a suitable default argument to std::byteswap(), though that
>> might constitute a breaking change (depending on how the default argument
>> is specified).
>> 3. The endian views could be specified such that they produce the
>> desired results without deference to std::byteswap() (though common
>> implementations could still implement them using std::byteswap()).
>>
>> Note that, for UTF endian conversions, both CHAR_BIT and the number of
>> bytes in the object representation must be correctly handled in order for
>> byte swapping to produce expected results. An implementation with CHAR_BIT
>> == 9 and sizeof(char16_t) == 4 should still be expected to convert 0xFEFF
>> to 0xFFFE to satisfy Unicode expectations. This result differs from a
>> generic byte endian conversion (which is what std::byteswap() implements
>> and what is proposed by P4030R0 for endian views) and therefore implies
>> that UTF endian conversion should use a distinct algorithm that is
>> independent of the value of CHAR_BIT, the presence of padding bits, and
>> excess value range representation.
>>
>> File streams and network interfaces typically provide a sequence of bytes
>> as opposed to a sequence of (possibly endian swapped) 16-bit or 32-bit
>> values. The primary motivation offered in P4030R0 is to support UTF
>> encodings, but it doesn't address sequences of bytes other than in its
>> (non-UTF) examples involving cipher suites. The Unicode Standard
>> specifies three encoding forms
>> <https://www.unicode.org/versions/Unicode17.0.0/core-spec/chapter-3/#G7404>
>> and seven encoding schemes
>> <https://www.unicode.org/versions/Unicode17.0.0/core-spec/chapter-3/#G28070>.
>> Encoding forms are never byte swapped and never contain a BOM; they
>> correspond to text in memory that is ready to be operated on as Unicode
>> text (e.g., text held in sequences of char8_t, char16_t, or char32_t).
>> Encoding schemes are byte oriented and may contain a BOM; they correspond
>> to data read from a file, network, or other device that is in an
>> interchange format and not necessarily ready to be directly operated on as
>> text (e.g., data in sequences of unsigned char, std::byte, or
>> std::uint16_t). The endian views proposed by P4030R0 are useful for the
>> UTF-16 and UTF-32 encoding schemes (when a BOM isn't present or is handled
>> separately), but do not provide direct support for UTF-16BE, UTF-16LE,
>> UTF-32BE, or UTF-32LE; at least not without a separate transform that
>> aggregates bytes into exact-width code unit size values before applying an
>> endian view.
>>
>> Lack of support for byte oriented streams and a BOM implies an incomplete
>> solution for the primary motivation stated in P4030R0. I think it could be
>> reasonable to provide such support in a different paper, but I think we
>> should have a clear vision for how that support would work with the
>> proposed endian views before we proceed to standardize them. Per comments
>> above, it seems that generic endian views using the C++ definition of a
>> byte and its code unit types are at odds with the intent of the Unicode
>> Standard with its expectations of 8-bit bytes and exact-width code unit
>> types.
>>
>> Tom.
>> --
>> SG16 mailing list
>> SG16_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>> Link to this post: http://lists.isocpp.org/sg16/2026/06/4734.php
>
>
>> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
> Link to this post: http://lists.isocpp.org/sg16/2026/06/4736.php
>

Received on 2026-06-21 21:05:17