ISOCPP sg16 List: [isocpp-sg16] Thoughts on P4030R0: Endian Views

From: Tom Honermann <tom_at_[hidden]>
Date: Sun, 21 Jun 2026 02:54:48 -0400

SG9 discussed P4030R0 (Endian Views) <https://wg21.link/p4030r0> in Brno
(notes <https://wiki.isocpp.org/NotesSG9P4030R0>, GH issue with polls
<https://github.com/cplusplus/papers/issues/2662#issuecomment-4678412530>)
this week and polled to forward it to LEWG. I was the sole dissenter on
the poll for reasons described below. SG16 has not yet reviewed the
proposal.

To be clear, I support the paper, but would like to see additional
analysis completed to gain more confidence in the design.

My concerns are listed below. Some of these were not discussed in SG9
because they didn't occur to me until later.

1. Behavior for implementations for which CHAR_BIT is not 8.
2. Behavior for std::[u]int_least/N/_t and similar type aliases
    ([cstdint.syn]p3 <https://eel.is/c++draft/cstdint.syn#3>) that alias
    a type with a value representation that exceeds the bits needed for
    /N/. Note that the underlying types of char16_t and char32_t are
    std::uint_least16_t and std::uint_least32_t respectively
    ([basic.fundamental]p9 <https://eel.is/c++draft/basic.fundamental#9>).
3. Lack of support for all of the Unicode encoding schemes
    <https://www.unicode.org/versions/Unicode17.0.0/core-spec/chapter-2/#G19273>
    (byte oriented as in UTF-16BE, UTF-16LE, UTF-32BE, and UTF-32LE;
    code unit oriented as in UTF-16 and UTF-32 where the code units are
    either big-endian or little-endian) or discussion of how the
    encoding schemes that are not directly supported could be provided
    separately.
4. Lack of support for the Unicode Byte Order Mark (BOM)
    <https://www.unicode.org/versions/Unicode17.0.0/core-spec/chapter-2/#G9354> or
    discussion of how support for a BOM could be provided separately.

P3477 (There are exactly 8 bits in a byte) <https://wg21.link/p3477> was
extensively discussed in EWG and LEWG in 2024-2025 but failed to gain
consensus across both groups. The status quo is therefore that bytes may
have more than 8 bits and such implementations are known to exist
(though none are supported by Clang, GCC, or MSVC; at least not in their
upstream repositories). std::byteswap() ([bit.byteswap]
<https://eel.is/c++draft/bit.byteswap#lib:byteswap>) does not have
differentiated behavior based on CHAR_BIT, but parts of the
std::text_encoding identification library do (see [locale.members]p6
<https://eel.is/c++draft/locale.members#6> and
[text.encoding.members]p11
<https://eel.is/c++draft/text.encoding.members#11>,p13
<https://eel.is/c++draft/text.encoding.members#13>,p17
<https://eel.is/c++draft/text.encoding.members#17>). We /could/ follow
suit and add CHAR_BIT == 8 mandates to the proposed endian views. If we
did, I would encourage adding a similar mandate to std::byteswap() as a DR.

std::byteswap() constrains the types it operates on to those that model
integral ([bit.byteswap]p1 <https://eel.is/c++draft/bit.byteswap#1>)
with an additional mandate for the lack of padding bits
([bit.byteswap]p2 <https://eel.is/c++draft/bit.byteswap#2>, but see LWG
4583 (std::byteswap can make sense for some types with padding bytes)
<https://cplusplus.github.io/LWG/issue4583>. It has no accommodations
for type aliases like std::uint_least16_t that alias a type with a range
of values that exceeds that required for the alias. As a result, given a
char16_t object /C/ with value 0xFEFF, an implementation for which
CHAR_BIT == 8 and std::uint_least16_t aliases a 32-bit type will produce
a value of 0xFFFE0000 rather than 0xFFFE for the expression
std::byteswap(C). This clearly fails the desired behavior for UTF-16
endian conversions. There are several ways this can be addressed:

1. A mandate can be added that types with a specified value range like
    char16_t have an underlying type of the exact size (e.g.,
    sizeof(char16_t) * CHAR_BIT == 16). That will suffice for common
    implementations, but excludes support for some known implementations.
2. An additional constant template parameter can be added to specify
    the number of bytes of the object representation to operate on. A
    default argument could then supply the correct value (based on a
    type trait) for types like char16_t while allowing the user to
    specify an appropriate value for type aliases like
    std::uint_least16_t. If we choose this approach, I would encourage
    adding a similar constant template parameter with a suitable default
    argument to std::byteswap(), though that might constitute a breaking
    change (depending on how the default argument is specified).
3. The endian views could be specified such that they produce the
    desired results without deference to std::byteswap() (though common
    implementations could still implement them using std::byteswap()).

Note that, for UTF endian conversions, both CHAR_BIT and the number of
bytes in the object representation must be correctly handled in order
for byte swapping to produce expected results. An implementation with
CHAR_BIT == 9 and sizeof(char16_t) == 4 should still be expected to
convert 0xFEFF to 0xFFFE to satisfy Unicode expectations. This result
differs from a generic byte endian conversion (which is what
std::byteswap() implements and what is proposed by P4030R0 for endian
views) and therefore implies that UTF endian conversion should use a
distinct algorithm that is independent of the value of CHAR_BIT, the
presence of padding bits, and excess value range representation.

File streams and network interfaces typically provide a sequence of
bytes as opposed to a sequence of (possibly endian swapped) 16-bit or
32-bit values. The primary motivation offered in P4030R0 is to support
UTF encodings, but it doesn't address sequences of bytes other than in
its (non-UTF) examples involving cipher suites. The Unicode Standard
specifies three encoding forms
<https://www.unicode.org/versions/Unicode17.0.0/core-spec/chapter-3/#G7404> and seven
encoding schemes
<https://www.unicode.org/versions/Unicode17.0.0/core-spec/chapter-3/#G28070>.
Encoding forms are never byte swapped and never contain a BOM; they
correspond to text in memory that is ready to be operated on as Unicode
text (e.g., text held in sequences of char8_t, char16_t, or char32_t).
Encoding schemes are byte oriented and may contain a BOM; they
correspond to data read from a file, network, or other device that is in
an interchange format and not necessarily ready to be directly operated
on as text (e.g., data in sequences of unsigned char, std::byte, or
std::uint16_t). The endian views proposed by P4030R0 are useful for the
UTF-16 and UTF-32 encoding schemes (when a BOM isn't present or is
handled separately), but do not provide direct support for UTF-16BE,
UTF-16LE, UTF-32BE, or UTF-32LE; at least not without a separate
transform that aggregates bytes into exact-width code unit size values
before applying an endian view.

Lack of support for byte oriented streams and a BOM implies an
incomplete solution for the primary motivation stated in P4030R0. I
think it could be reasonable to provide such support in a different
paper, but I think we should have a clear vision for how that support
would work with the proposed endian views before we proceed to
standardize them. Per comments above, it seems that generic endian views
using the C++ definition of a byte and its code unit types are at odds
with the intent of the Unicode Standard with its expectations of 8-bit
bytes and exact-width code unit types.

Tom.

Received on 2026-06-21 06:54:58