Date: Tue, 11 Oct 2022 09:43:37 +0200
On Mon, Oct 10, 2022 at 6:01 AM Tom Honermann <tom_at_[hidden]> wrote:
> On 10/2/22 5:45 PM, Corentin Jabot via SG16 wrote:
>
> On the second poll, I'll copy the message I sent before the meeting
>
> --
>
> There are further issues here.
> The width of grapheme is independent of encodings.
> We are just not forcing implementation not to decode. Is that what we want?
> I don't think it is useful.
> Most encodings cannot represent any of the wide codepoints, the wideness
> of codepoints in shift jis can be derived without doing a full decoding.
>
> Suggested resolution:
> For a string decoded to a sequence of unicode codepoints, its width is the
> sum of estimated widths of the first code points in its extended grapheme
> clusters.
>
> If the intent is for implementers to throw their hands in the air when the
> encoding is not "a unicode encoding", then surely
> we want to support UTF-8/16/32 and that's it. UTF-EBCDIC isn't more
> important or special than shift-jis and there is no reason for one encoding
> to have privileged handling over the other.
>
> I think that is where we ended up; the intent is only to specify behavior
> for UTF-8, UTF-16, and UTF-32. I think the best we could do for encodings
> that are not defined by the C++ standard or one of its normative references
> would be to add normative guidance to do likewise for all
> implementation-defined encodings; in which case there would be no need to
> restrict guidance to Unicode encodings; we could simply specify widths for
> characters independently of how they are mapped to any specific encoding.
>
>
>
> More generally, any unicode that can round trip through Unicode should
> qualify as Unicode encoding, but I don't think we have a definition of that
> anywhere.
> Unicode defines Unicode Encoding Form
> > A character encoding form that assigns each Unicode scalar value to a
> unique code unit sequence
>
> --
>
> Ie, I don't think the poll solves anything, it just uses a different
> terminology to describe the same thing (nothing in iso 10646 leads me to
> believe that "ucs encoding scheme" can only designate ucs encoding forms
> specified in iso 10646 - in addition of being obscure terminology).
>
> It solves the issue that there is no definition for "Unicode encoding".
> "UCS encoding scheme" at least has a definition. If the definition in
> ISO/IEC 10646 is not clear, then I would argue that is a concern to raise
> with WG2.
>
It's clear, just not limited in the way that we want.
I think we would be much better off by saying "For UTF-8, UTF-16 and
UTF-32". It doesn't leave much room for confusion
>
> On GB18030, it's a different character set, with its own set of encodings
>
> I've been under the impression that, as of GB 18030-2022, use of the PUA
> is no longer required because all GB 18030 specified characters are now
> represented in Unicode. In other words, the Unicode repertoire is a
> superset of the GB 18030 repertoire. Is that not correct? Its specified
> encodings are, of course, distinct.
>
Yes, I do believe that, as of this year, all characters representable in GB
18030 can be represented in Unicode.
But the fact that Unicode is a superset makes any of the GB18030 encodings
not suitable to represent Unicode.
Maybe I'm being overly pedantic here.
Tom.
>
>
> On Sun, Oct 2, 2022 at 10:51 PM Tom Honermann via SG16 <
> sg16_at_[hidden]> wrote:
>
>> The summary for the SG16 meeting held September 28th, 2022 is now
>> available. For those that attended, please review and suggest corrections.
>>
>> - https://github.com/sg16-unicode/sg16-meetings/#september-28th-2022
>>
>> Two polls were taken during this meeting.
>>
>> The first was for LWG #3767 (codecvt<charN_t, char8_t, mbstate_t>
>> incorrectly added to locale <https://cplusplus.github.io/LWG/issue3767>)
>> to establish consensus on whether the codecvt facets mentioned in the
>> issue are intended to be locale sensitive. The established position has
>> been conveyed to LWG via GitHub issue 1310
>> <https://github.com/cplusplus/papers/issues/1310>.
>>
>> The second was for LWG #3412 (ยง[format.string.std] references to
>> "Unicode encoding" unclear <https://cplusplus.github.io/LWG/issue3412>)
>> to establish consensus on a direction for a proposed resolution.
>> Tom.
>> --
>> SG16 mailing list
>> SG16_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>
>
>
> On 10/2/22 5:45 PM, Corentin Jabot via SG16 wrote:
>
> On the second poll, I'll copy the message I sent before the meeting
>
> --
>
> There are further issues here.
> The width of grapheme is independent of encodings.
> We are just not forcing implementation not to decode. Is that what we want?
> I don't think it is useful.
> Most encodings cannot represent any of the wide codepoints, the wideness
> of codepoints in shift jis can be derived without doing a full decoding.
>
> Suggested resolution:
> For a string decoded to a sequence of unicode codepoints, its width is the
> sum of estimated widths of the first code points in its extended grapheme
> clusters.
>
> If the intent is for implementers to throw their hands in the air when the
> encoding is not "a unicode encoding", then surely
> we want to support UTF-8/16/32 and that's it. UTF-EBCDIC isn't more
> important or special than shift-jis and there is no reason for one encoding
> to have privileged handling over the other.
>
> I think that is where we ended up; the intent is only to specify behavior
> for UTF-8, UTF-16, and UTF-32. I think the best we could do for encodings
> that are not defined by the C++ standard or one of its normative references
> would be to add normative guidance to do likewise for all
> implementation-defined encodings; in which case there would be no need to
> restrict guidance to Unicode encodings; we could simply specify widths for
> characters independently of how they are mapped to any specific encoding.
>
>
>
> More generally, any unicode that can round trip through Unicode should
> qualify as Unicode encoding, but I don't think we have a definition of that
> anywhere.
> Unicode defines Unicode Encoding Form
> > A character encoding form that assigns each Unicode scalar value to a
> unique code unit sequence
>
> --
>
> Ie, I don't think the poll solves anything, it just uses a different
> terminology to describe the same thing (nothing in iso 10646 leads me to
> believe that "ucs encoding scheme" can only designate ucs encoding forms
> specified in iso 10646 - in addition of being obscure terminology).
>
> It solves the issue that there is no definition for "Unicode encoding".
> "UCS encoding scheme" at least has a definition. If the definition in
> ISO/IEC 10646 is not clear, then I would argue that is a concern to raise
> with WG2.
>
It's clear, just not limited in the way that we want.
I think we would be much better off by saying "For UTF-8, UTF-16 and
UTF-32". It doesn't leave much room for confusion
>
> On GB18030, it's a different character set, with its own set of encodings
>
> I've been under the impression that, as of GB 18030-2022, use of the PUA
> is no longer required because all GB 18030 specified characters are now
> represented in Unicode. In other words, the Unicode repertoire is a
> superset of the GB 18030 repertoire. Is that not correct? Its specified
> encodings are, of course, distinct.
>
Yes, I do believe that, as of this year, all characters representable in GB
18030 can be represented in Unicode.
But the fact that Unicode is a superset makes any of the GB18030 encodings
not suitable to represent Unicode.
Maybe I'm being overly pedantic here.
Tom.
>
>
> On Sun, Oct 2, 2022 at 10:51 PM Tom Honermann via SG16 <
> sg16_at_[hidden]> wrote:
>
>> The summary for the SG16 meeting held September 28th, 2022 is now
>> available. For those that attended, please review and suggest corrections.
>>
>> - https://github.com/sg16-unicode/sg16-meetings/#september-28th-2022
>>
>> Two polls were taken during this meeting.
>>
>> The first was for LWG #3767 (codecvt<charN_t, char8_t, mbstate_t>
>> incorrectly added to locale <https://cplusplus.github.io/LWG/issue3767>)
>> to establish consensus on whether the codecvt facets mentioned in the
>> issue are intended to be locale sensitive. The established position has
>> been conveyed to LWG via GitHub issue 1310
>> <https://github.com/cplusplus/papers/issues/1310>.
>>
>> The second was for LWG #3412 (ยง[format.string.std] references to
>> "Unicode encoding" unclear <https://cplusplus.github.io/LWG/issue3412>)
>> to establish consensus on a direction for a proposed resolution.
>> Tom.
>> --
>> SG16 mailing list
>> SG16_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>
>
>
Received on 2022-10-11 07:43:49