ISOCPP sg16 List: Re: [isocpp-sg16] Thoughts on P4030R0: Endian Views

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Sun, 21 Jun 2026 18:39:10 +0200

Let's add mandates for CHAR_BIT == 8 everywhere UTF related. UTF is simply
not defined in other scenarios.
The vote on P3477 does not change the fact that the maintenance of a C++
compiler for such a platform that is less than 2 decades old has not been
demonstrated. P3477 was certainly aiming to prevent these discussions!

I think the other concerns are orthogonal. Automagic BOM recognition
certainly doesn't belong here - even for char8_t - it's not free and error
prone. An adaptor could be constructed to provide that functionality.
(there are a couple application-level choices here, that the standard
should not try to answer even if we provide more convenience: do you
recognize the BOM?Do you preserve it?)

The examples in the paper show the use of as_char16_t, that would I think
address your concerns about BE/LE.
I do think having a sizeof(char16_t) == 2 Mandates somewhere is a good
idea though. Maybe in as_char16_t ?
Furthermore, Eddie's examples use uintN_t to model encoding schemes, which
is what we should encourage - charN_t is meant to represent sequences of
code units, not serialized data.

I like the idea of an endian view in that, if we want to care about
UTF16LE/BE. we should do it in exactly one place and design it so people
don't use it unless they really meant to do so.
The Unicode recommendation of defaulting to BE is not followed, so in
practice, most Unicode content is LE and unless you target a BE platform
(IBM strikes again), you can and should ignore all of that.

On Sun, Jun 21, 2026 at 8:55 AM Tom Honermann via SG16 <
sg16_at_[hidden]> wrote:

> SG9 discussed P4030R0 (Endian Views) <https://wg21.link/p4030r0> in Brno (
> notes <https://wiki.isocpp.org/NotesSG9P4030R0>, GH issue with polls
> <https://github.com/cplusplus/papers/issues/2662#issuecomment-4678412530>)
> this week and polled to forward it to LEWG. I was the sole dissenter on the
> poll for reasons described below. SG16 has not yet reviewed the proposal.
>
> To be clear, I support the paper, but would like to see additional
> analysis completed to gain more confidence in the design.
>
> My concerns are listed below. Some of these were not discussed in SG9
> because they didn't occur to me until later.
>
> 1. Behavior for implementations for which CHAR_BIT is not 8.
> 2. Behavior for std::[u]int_least*N*_t and similar type aliases (
> [cstdint.syn]p3 <https://eel.is/c++draft/cstdint.syn#3>) that alias a
> type with a value representation that exceeds the bits needed for *N*.
> Note that the underlying types of char16_t and char32_t are
> std::uint_least16_t and std::uint_least32_t respectively (
> [basic.fundamental]p9 <https://eel.is/c++draft/basic.fundamental#9>).
> 3. Lack of support for all of the Unicode encoding schemes
> <https://www.unicode.org/versions/Unicode17.0.0/core-spec/chapter-2/#G19273>
> (byte oriented as in UTF-16BE, UTF-16LE, UTF-32BE, and UTF-32LE; code unit
> oriented as in UTF-16 and UTF-32 where the code units are either big-endian
> or little-endian) or discussion of how the encoding schemes that are not
> directly supported could be provided separately.
> 4. Lack of support for the Unicode Byte Order Mark (BOM)
> <https://www.unicode.org/versions/Unicode17.0.0/core-spec/chapter-2/#G9354> or
> discussion of how support for a BOM could be provided separately.
>
> P3477 (There are exactly 8 bits in a byte) <https://wg21.link/p3477> was
> extensively discussed in EWG and LEWG in 2024-2025 but failed to gain
> consensus across both groups. The status quo is therefore that bytes may
> have more than 8 bits and such implementations are known to exist (though
> none are supported by Clang, GCC, or MSVC; at least not in their upstream
> repositories). std::byteswap() ([bit.byteswap]
> <https://eel.is/c++draft/bit.byteswap#lib:byteswap>) does not have
> differentiated behavior based on CHAR_BIT, but parts of the
> std::text_encoding identification library do (see [locale.members]p6
> <https://eel.is/c++draft/locale.members#6> and [text.encoding.members]p11
> <https://eel.is/c++draft/text.encoding.members#11>,p13
> <https://eel.is/c++draft/text.encoding.members#13>,p17
> <https://eel.is/c++draft/text.encoding.members#17>). We *could* follow
> suit and add CHAR_BIT == 8 mandates to the proposed endian views. If we
> did, I would encourage adding a similar mandate to std::byteswap() as a
> DR.
>
> std::byteswap() constrains the types it operates on to those that model
> integral ([bit.byteswap]p1 <https://eel.is/c++draft/bit.byteswap#1>) with
> an additional mandate for the lack of padding bits ([bit.byteswap]p2
> <https://eel.is/c++draft/bit.byteswap#2>, but see LWG 4583 (std::byteswap
> can make sense for some types with padding bytes)
> <https://cplusplus.github.io/LWG/issue4583>. It has no accommodations for
> type aliases like std::uint_least16_t that alias a type with a range of
> values that exceeds that required for the alias. As a result, given a
> char16_t object *C* with value 0xFEFF, an implementation for which CHAR_BIT
> == 8 and std::uint_least16_t aliases a 32-bit type will produce a value
> of 0xFFFE0000 rather than 0xFFFE for the expression std::byteswap(C).
> This clearly fails the desired behavior for UTF-16 endian conversions.
> There are several ways this can be addressed:
>
> 1. A mandate can be added that types with a specified value range like
> char16_t have an underlying type of the exact size (e.g., sizeof(char16_t)
> * CHAR_BIT == 16). That will suffice for common implementations, but
> excludes support for some known implementations.
> 2. An additional constant template parameter can be added to specify
> the number of bytes of the object representation to operate on. A default
> argument could then supply the correct value (based on a type trait) for
> types like char16_t while allowing the user to specify an appropriate
> value for type aliases like std::uint_least16_t. If we choose this
> approach, I would encourage adding a similar constant template parameter
> with a suitable default argument to std::byteswap(), though that might
> constitute a breaking change (depending on how the default argument is
> specified).
> 3. The endian views could be specified such that they produce the
> desired results without deference to std::byteswap() (though common
> implementations could still implement them using std::byteswap()).
>
> Note that, for UTF endian conversions, both CHAR_BIT and the number of
> bytes in the object representation must be correctly handled in order for
> byte swapping to produce expected results. An implementation with CHAR_BIT
> == 9 and sizeof(char16_t) == 4 should still be expected to convert 0xFEFF
> to 0xFFFE to satisfy Unicode expectations. This result differs from a
> generic byte endian conversion (which is what std::byteswap() implements
> and what is proposed by P4030R0 for endian views) and therefore implies
> that UTF endian conversion should use a distinct algorithm that is
> independent of the value of CHAR_BIT, the presence of padding bits, and
> excess value range representation.
>
> File streams and network interfaces typically provide a sequence of bytes
> as opposed to a sequence of (possibly endian swapped) 16-bit or 32-bit
> values. The primary motivation offered in P4030R0 is to support UTF
> encodings, but it doesn't address sequences of bytes other than in its
> (non-UTF) examples involving cipher suites. The Unicode Standard
> specifies three encoding forms
> <https://www.unicode.org/versions/Unicode17.0.0/core-spec/chapter-3/#G7404>
> and seven encoding schemes
> <https://www.unicode.org/versions/Unicode17.0.0/core-spec/chapter-3/#G28070>.
> Encoding forms are never byte swapped and never contain a BOM; they
> correspond to text in memory that is ready to be operated on as Unicode
> text (e.g., text held in sequences of char8_t, char16_t, or char32_t).
> Encoding schemes are byte oriented and may contain a BOM; they correspond
> to data read from a file, network, or other device that is in an
> interchange format and not necessarily ready to be directly operated on as
> text (e.g., data in sequences of unsigned char, std::byte, or
> std::uint16_t). The endian views proposed by P4030R0 are useful for the
> UTF-16 and UTF-32 encoding schemes (when a BOM isn't present or is handled
> separately), but do not provide direct support for UTF-16BE, UTF-16LE,
> UTF-32BE, or UTF-32LE; at least not without a separate transform that
> aggregates bytes into exact-width code unit size values before applying an
> endian view.
>
> Lack of support for byte oriented streams and a BOM implies an incomplete
> solution for the primary motivation stated in P4030R0. I think it could be
> reasonable to provide such support in a different paper, but I think we
> should have a clear vision for how that support would work with the
> proposed endian views before we proceed to standardize them. Per comments
> above, it seems that generic endian views using the C++ definition of a
> byte and its code unit types are at odds with the intent of the Unicode
> Standard with its expectations of 8-bit bytes and exact-width code unit
> types.
>
> Tom.
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
> Link to this post: http://lists.isocpp.org/sg16/2026/06/4734.php
>

Received on 2026-06-21 16:39:37