On Mon, Oct 10, 2022 at 6:01 AM Tom Honermann <tom@honermann.net> wrote:

On 10/2/22 5:45 PM, Corentin Jabot via SG16 wrote:

On the second poll, I'll copy the message I sent before the meeting

--

There are further issues here.
The width of grapheme is independent of encodings.
We are just not forcing implementation not to decode. Is that what we want?
I don't think it is useful.
Most encodings cannot represent any of the wide codepoints, the wideness of codepoints in shift jis can be derived without doing a full decoding.

Suggested resolution:
For a string decoded to a sequence of unicode codepoints, its width is the sum of estimated widths of the first code points in its extended grapheme clusters.

If the intent is for implementers to throw their hands in the air when the encoding is not "a unicode encoding", then surely
we want to support UTF-8/16/32 and that's it. UTF-EBCDIC isn't more important or special than shift-jis and there is no reason for one encoding to have privileged handling over the other.

I think that is where we ended up; the intent is only to specify behavior for UTF-8, UTF-16, and UTF-32. I think the best we could do for encodings that are not defined by the C++ standard or one of its normative references would be to add normative guidance to do likewise for all implementation-defined encodings; in which case there would be no need to restrict guidance to Unicode encodings; we could simply specify widths for characters independently of how they are mapped to any specific encoding.

More generally, any unicode that can round trip through Unicode should qualify as Unicode encoding, but I don't think we have a definition of that anywhere.
Unicode defines Unicode Encoding Form
> A character encoding form that assigns each Unicode scalar value to a unique code unit sequence

--

Ie, I don't think the poll solves anything, it just uses a different terminology to describe the same thing (nothing in iso 10646 leads me to believe that "ucs encoding scheme" can only designate ucs encoding forms specified in iso 10646 - in addition of being obscure terminology).

It solves the issue that there is no definition for "Unicode encoding". "UCS encoding scheme" at least has a definition. If the definition in ISO/IEC 10646 is not clear, then I would argue that is a concern to raise with WG2.

It's clear, just not limited in the way that we want.

I think we would be much better off by saying "For UTF-8, UTF-16 and UTF-32". It doesn't leave much room for confusion

On GB18030, it's a different character set, with its own set of encodings

I've been under the impression that, as of GB 18030-2022, use of the PUA is no longer required because all GB 18030 specified characters are now represented in Unicode. In other words, the Unicode repertoire is a superset of the GB 18030 repertoire. Is that not correct? Its specified encodings are, of course, distinct.

Yes, I do believe that, as of this year, all characters representable in GB 18030 can be represented in Unicode.

But the fact that Unicode is a superset makes any of the GB18030 encodings not suitable to represent Unicode.

Maybe I'm being overly pedantic here.

Tom.

On Sun, Oct 2, 2022 at 10:51 PM Tom Honermann via SG16 <sg16@lists.isocpp.org> wrote:

The summary for the SG16 meeting held September 28th, 2022 is now available. For those that attended, please review and suggest corrections.

https://github.com/sg16-unicode/sg16-meetings/#september-28th-2022

Two polls were taken during this meeting.

The first was for LWG #3767 (codecvt<charN_t, char8_t, mbstate_t> incorrectly added to locale) to establish consensus on whether the codecvt facets mentioned in the issue are intended to be locale sensitive. The established position has been conveyed to LWG via GitHub issue 1310.

The second was for LWG #3412 (§[format.string.std] references to "Unicode encoding" unclear) to establish consensus on a direction for a proposed resolution.

Tom.
--
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16