Date: Sat, 4 Dec 2021 12:05:45 +0100
On 04/12/2021 09.26, Corentin Jabot wrote:
>
>
> On Sat, Dec 4, 2021, 01:04 Tom Honermann <tom_at_[hidden] <mailto:tom_at_[hidden]>> wrote:
> I believe this requirement is already the status quo. Let me provide a better example than I did previously.
>
> std::format("<text>");
>
> If the literal encoding is not self-synchronizing then <text> may contain code units that correspond to the (single) code unit for '{' but that do not encode the '{' character. This can happen due to DBCS or shift-state encoding. An implementation needs to be able to recognize this case (for effected encodings) in order to avoid incorrectly interpreting the text as containing an introducer for a replacement field.
>
>
> I am well aware.
> I wonder if we understood that fully (compile time support and codepoint semantics were decision taken at about the same time independently of one another). I do not recall realizing that we were asking for full blown constexpr codepoint decode.
If we impose a requirement for a code unit -> code point decoder for the
literal encoding at compile-time, we should make such a facility generally
available instead of hiding it in the guts of the std::format parser.
That probably means at least a constexpr mbrlen and/or mbrtowc
with an assumed locale fitting the literal encoding.
Hm... It seems mbrlen returning 1 might not be as helpful as it
seems, because we could be in a foreign shift state where
code unit value == '}' does not actually mean a '}' is encoded.
> I think I'd like to get input from implementers.
> If I understand this msvc PR, support for compile time non UTF-8 multi bytes encoding is limited
> https://github.com/microsoft/STL/pull/2221 <https://github.com/microsoft/STL/pull/2221>
>
> I am not opposed to the direction to be clear, but I am reluctant to go further down this road without implementers support. We are asking a lot.
Yes.
> For reasons, the work to add EBCDIC to clang has a home grown encoder, for example, as clang cares about environments where iconv is not present.
> This direction would likely, in addition to add constexpr builtins mandate that someone writes an EBCDIC -> utf decoder in clang or libc++.
>
> It makes me wonder if some of these features should be restricted to u8 formatting strings 😅
We even have a "u8" string literal prefix to indicate UTF-8 string literals.
What a foresight.
Jens
>
>
> On Sat, Dec 4, 2021, 01:04 Tom Honermann <tom_at_[hidden] <mailto:tom_at_[hidden]>> wrote:
> I believe this requirement is already the status quo. Let me provide a better example than I did previously.
>
> std::format("<text>");
>
> If the literal encoding is not self-synchronizing then <text> may contain code units that correspond to the (single) code unit for '{' but that do not encode the '{' character. This can happen due to DBCS or shift-state encoding. An implementation needs to be able to recognize this case (for effected encodings) in order to avoid incorrectly interpreting the text as containing an introducer for a replacement field.
>
>
> I am well aware.
> I wonder if we understood that fully (compile time support and codepoint semantics were decision taken at about the same time independently of one another). I do not recall realizing that we were asking for full blown constexpr codepoint decode.
If we impose a requirement for a code unit -> code point decoder for the
literal encoding at compile-time, we should make such a facility generally
available instead of hiding it in the guts of the std::format parser.
That probably means at least a constexpr mbrlen and/or mbrtowc
with an assumed locale fitting the literal encoding.
Hm... It seems mbrlen returning 1 might not be as helpful as it
seems, because we could be in a foreign shift state where
code unit value == '}' does not actually mean a '}' is encoded.
> I think I'd like to get input from implementers.
> If I understand this msvc PR, support for compile time non UTF-8 multi bytes encoding is limited
> https://github.com/microsoft/STL/pull/2221 <https://github.com/microsoft/STL/pull/2221>
>
> I am not opposed to the direction to be clear, but I am reluctant to go further down this road without implementers support. We are asking a lot.
Yes.
> For reasons, the work to add EBCDIC to clang has a home grown encoder, for example, as clang cares about environments where iconv is not present.
> This direction would likely, in addition to add constexpr builtins mandate that someone writes an EBCDIC -> utf decoder in clang or libc++.
>
> It makes me wonder if some of these features should be restricted to u8 formatting strings 😅
We even have a "u8" string literal prefix to indicate UTF-8 string literals.
What a foresight.
Jens
Received on 2021-12-04 05:05:54