sg16: Re: [SG16] Agenda for the 2021-12-01 SG16 telecon

From: Charlie Barto <Charles.Barto_at_[hidden]>
Date: Tue, 7 Dec 2021 22:58:00 +0000

Charlie’s Opinion:

I would love it if some of these features were restricted to self-synchronizing encodings, it gets _really_ nasty once you bring in the others (shift-jis, GBK, and gb18030 (which is gbk derived, I think big-5 might also suffer from this affliction). My work on implementing P2216 only really works for self-synchronizing encodings, falling back to runtime validation for the others. Currently it actually only supports utf-8, but I plan to add a list of “OK” encodings that would include the usual legacy ones before it ships. Even when it has to fall back it still enforces that the format string itself is a compile time constant.

Implementing robust parsing for all the shift encodings at compile time really does seem like a _whole_ extra project, and does increase the implementation difficulty quite a lot. Our implementation already treats utf-8 (and soon other self-synchronizing encodings) as the good path, and is over an order of magnitude slower if we need to deal with shift encodings (It’s actually slower than iostreams, by a factor of 1.5-2 on that slow path).

I think allowing implementations to limit features to self-synchronizing encodings is sufficient, if the encoding has that property then we never have to decode it except to compute width via grapheme clusters, and presumably we have some way of skipping over one code unit without fully decoding it (that’s also not a compile-time thing).

For fill characters in particular allowing implementations to just slurp up “one character” and then throw an error if the next character isn’t the following valid format control character should be sufficient. I’m not sure we have wording for this (it might be tricky, esp if we need to decide on how to handle bogus fill characters). This also only works easily for self-synchronizing encodings (although that might not even be strong enough).

In fact, when implementing the grapheme based width computation I ended up having to implement my own UTF-8 decoding routine, since I wanted to handle invalid UTF-8 in a consistent way (after all, there’s no guarantee the string you’re computing the width of has any particular encoding in the real world), and I didn’t want to make a real indirect call into libicu for every single character in that string.

From: SG16 <sg16-bounces_at_[hidden]> On Behalf Of Corentin Jabot via SG16
Sent: Saturday, December 4, 2021 12:27 AM
To: Tom Honermann <tom_at_[hidden]>
Cc: Corentin Jabot <corentinjabot_at_[hidden]>; SG16 <sg16_at_[hidden]>; Barry Revzin <barry.revzin_at_[hidden]>
Subject: Re: [SG16] Agenda for the 2021-12-01 SG16 telecon

On Sat, Dec 4, 2021, 01:04 Tom Honermann <tom_at_[hidden]<mailto:tom_at_[hidden]>> wrote:
On 12/3/21 4:47 PM, Corentin Jabot wrote:

On Fri, Dec 3, 2021, 22:03 Tom Honermann <tom_at_[hidden]<mailto:tom_at_[hidden]>> wrote:
On 12/1/21 2:28 PM, Corentin Jabot wrote:

On Wed, Dec 1, 2021 at 8:13 PM Tom Honermann <tom_at_[hidden]<mailto:tom_at_[hidden]>> wrote:
On 11/28/21 5:22 AM, Jens Maurer wrote:

On 28/11/2021 10.42, Corentin Jabot via SG16 wrote:

On Sun, Nov 28, 2021, 01:31 Tom Honermann via SG16 <sg16_at_[hidden]<mailto:sg16_at_[hidden]> <mailto:sg16_at_[hidden]><mailto:sg16_at_[hidden]>> wrote:

     2. If the estimated width of the fill character is greater than 1, then alignment to the end of the available space might not be possible. The choice here is whether to under-fill or over-fill the available space. This possibility is avoided if fill characters are restricted to characters with an estimated width of exactly 1.

        std::format("{:🤡>4}", 123);

Is there value in specifying it? Neither solutions are great nor terrible, i think saying unspecified would be fine, so would underfilling i guess.

Hopefully, we are consistent and choose option 1 among those specified in the lwg issue

    For P2286R3 <https://wg21.link/p2286r3><https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwg21.link%2Fp2286r3&data=04%7C01%7CCharles.Barto%40microsoft.com%7C5fc40e5b0fce4a2fb67b08d9b6ffdde5%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637742032349168440%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=siCTrByagFFNq6Wy6D7fsLVZrchluFxChJL5jxi%2FGAM%3D&reserved=0>, LEWG requested <https://lists.isocpp.org/sg16/2021/11/2845.php><https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.isocpp.org%2Fsg16%2F2021%2F11%2F2845.php&data=04%7C01%7CCharles.Barto%40microsoft.com%7C5fc40e5b0fce4a2fb67b08d9b6ffdde5%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637742032349178397%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=5vmvjgF9mmL558LkjZBUd%2FMBNcgldbkZCAwdEIJPNAc%3D&reserved=0> that SG16 consider the ramifications for support of user defined delimiters. We should also discuss the "?" specifier proposed to explicitly opt in to quoted and escaped formats for std::string, std::string_view, and arrays of char/wchar_t.

Not sure the quoted thing is in our purview.

For the delimiter, we should support codepoints, to be consistent with everything else. The issue is the we don't have experience with that afaik.

But the compile-time format string parser might not necessarily understand

the details of the literal encoding, so it's unclear how codepoints map to

code units. Or are you saying that the rest of std::format already requires

detailed understanding, anyway?

I believe the compile-time format string parser is already required to understand such details. For example, if the literal encoding is Shift-JIS, then the parser would need to be able to differentiate byte values that appear as lead code units vs trailing code units (since, for example, a 0x5C code unit denotes the '\' character if it is a lead code unit, but that value may also appear as a trailing code unit for a double byte character).
I think Jens is right. MSVC does handle Shift-JIS specifically but I'm not sure we can/should mandate something that work universally, the burden on implementation could be high)

Are you suggesting that we should revisit the consensus for the proposed resolution for LWG3576<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcplusplus.github.io%2FLWG%2Fissue3576&data=04%7C01%7CCharles.Barto%40microsoft.com%7C5fc40e5b0fce4a2fb67b08d9b6ffdde5%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637742032349178397%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=T%2BKjhWXcnpP2MspQnB%2Fi78pFUGbjdvPFq5TLPZ0ycqA%3D&reserved=0> from our 2021-08-25 telecon<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fsg16-unicode%2Fsg16-meetings%23august-25th-2021&data=04%7C01%7CCharles.Barto%40microsoft.com%7C5fc40e5b0fce4a2fb67b08d9b6ffdde5%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637742032349188354%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=CaV8pbcq8i0b4d8pcy%2FbrLtpw693L1chDuSssZrgol0%3D&reserved=0>?

I am concerned about implementability
The current resolution calls for a compile time mechanism to read a codepoint for arbitrary encoding.
Such mechanism currently doesn't exist.
For an implementation like GCC, the generic solution would be to expose iconv facilities through builtins (the equivalent of mblen or mbrtocX at least, i think, as Hubert pointed out)
This seems... A lot to ask in an issue resolution.
I don't remember if that was considered last time or if it constitute new information in anyway but we might want to bring that up again.

I believe this requirement is already the status quo. Let me provide a better example than I did previously.

std::format("<text>");

If the literal encoding is not self-synchronizing then <text> may contain code units that correspond to the (single) code unit for '{' but that do not encode the '{' character. This can happen due to DBCS or shift-state encoding. An implementation needs to be able to recognize this case (for effected encodings) in order to avoid incorrectly interpreting the text as containing an introducer for a replacement field.

I am well aware.
I wonder if we understood that fully (compile time support and codepoint semantics were decision taken at about the same time independently of one another). I do not recall realizing that we were asking for full blown constexpr codepoint decode.

I think I'd like to get input from implementers.
If I understand this msvc PR, support for compile time non UTF-8 multi bytes encoding is limited
https://github.com/microsoft/STL/pull/2221<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmicrosoft%2FSTL%2Fpull%2F2221&data=04%7C01%7CCharles.Barto%40microsoft.com%7C5fc40e5b0fce4a2fb67b08d9b6ffdde5%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637742032349188354%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=%2B4oygn4RTaCiKbPruQRxDkckNnCjnR7MZYZk%2BzslilM%3D&reserved=0>

I am not opposed to the direction to be clear, but I am reluctant to go further down this road without implementers support. We are asking a lot.

For reasons, the work to add EBCDIC to clang has a home grown encoder, for example, as clang cares about environments where iconv is not present.
This direction would likely, in addition to add constexpr builtins mandate that someone writes an EBCDIC -> utf decoder in clang or libc++.

It makes me wonder if some of these features should be restricted to u8 formatting strings 😅

It might turn out to be a non issue, but it's worth making sure we are all on the same page.

Charlie, opinion?

Tom.

Tom.

I agree with Corentin that delimiters should be restricted to code points. That is consistent with the direction we have already advocated for fill characters in LWG3576<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcplusplus.github.io%2FLWG%2Fissue3576&data=04%7C01%7CCharles.Barto%40microsoft.com%7C5fc40e5b0fce4a2fb67b08d9b6ffdde5%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637742032349198309%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=PjLdfHkQUP9tNFAlgqf%2Bh%2BjLBM6khEZOcNiwblIjcGM%3D&reserved=0>.

Tom.

Received on 2021-12-07 16:58:07