Date: Mon, 10 Oct 2022 00:01:06 -0400
On 10/2/22 5:45 PM, Corentin Jabot via SG16 wrote:
> On the second poll, I'll copy the message I sent before the meeting
>
> --
>
> There are further issues here.
> The width of grapheme is independent of encodings.
> We are just not forcing implementation not to decode. Is that what we
> want?
> I don't think it is useful.
> Most encodings cannot represent any of the wide codepoints, the
> wideness of codepoints in shift jis can be derived without doing a
> full decoding.
>
> Suggested resolution:
> For a string decoded to a sequence of unicode codepoints, its width is
> the sum of estimated widths of the first code points in its extended
> grapheme clusters.
>
> If the intent is for implementers to throw their hands in the air when
> the encoding is not "a unicode encoding", then surely
> we want to support UTF-8/16/32 and that's it. UTF-EBCDIC isn't more
> important or special than shift-jis and there is no reason for one
> encoding to have privileged handling over the other.
I think that is where we ended up; the intent is only to specify
behavior for UTF-8, UTF-16, and UTF-32. I think the best we could do for
encodings that are not defined by the C++ standard or one of its
normative references would be to add normative guidance to do likewise
for all implementation-defined encodings; in which case there would be
no need to restrict guidance to Unicode encodings; we could simply
specify widths for characters independently of how they are mapped to
any specific encoding.
>
>
> More generally, any unicode that can round trip through Unicode should
> qualify as Unicode encoding, but I don't think we have a definition of
> that anywhere.
> Unicode defines Unicode Encoding Form
> > A character encoding form that assigns each Unicode scalar value to
> a unique code unit sequence
>
> --
>
> Ie, I don't think the poll solves anything, it just uses a different
> terminology to describe the same thing (nothing in iso 10646 leads me
> to believe that "ucs encoding scheme" can only designate ucs encoding
> forms specified in iso 10646 - in addition of being obscure terminology).
It solves the issue that there is no definition for "Unicode encoding".
"UCS encoding scheme" at least has a definition. If the definition in
ISO/IEC 10646 is not clear, then I would argue that is a concern to
raise with WG2.
>
> On GB18030, it's a different character set, with its own set of encodings
I've been under the impression that, as of GB 18030-2022, use of the PUA
is no longer required because all GB 18030 specified characters are now
represented in Unicode. In other words, the Unicode repertoire is a
superset of the GB 18030 repertoire. Is that not correct? Its specified
encodings are, of course, distinct.
Tom.
>
> On Sun, Oct 2, 2022 at 10:51 PM Tom Honermann via SG16
> <sg16_at_[hidden]> wrote:
>
> The summary for the SG16 meeting held September 28th, 2022 is now
> available. For those that attended, please review and suggest
> corrections.
>
> * https://github.com/sg16-unicode/sg16-meetings/#september-28th-2022
>
> Two polls were taken during this meeting.
>
> The first was for LWG #3767 (codecvt<charN_t, char8_t, mbstate_t>
> incorrectly added to locale
> <https://cplusplus.github.io/LWG/issue3767>) to establish
> consensus on whether the codecvt facets mentioned in the issue are
> intended to be locale sensitive. The established position has been
> conveyed to LWG via GitHub issue 1310
> <https://github.com/cplusplus/papers/issues/1310>.
>
> The second was for LWG #3412 (ยง[format.string.std] references to
> "Unicode encoding" unclear
> <https://cplusplus.github.io/LWG/issue3412>) to establish
> consensus on a direction for a proposed resolution.
>
> Tom.
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>
>
> On the second poll, I'll copy the message I sent before the meeting
>
> --
>
> There are further issues here.
> The width of grapheme is independent of encodings.
> We are just not forcing implementation not to decode. Is that what we
> want?
> I don't think it is useful.
> Most encodings cannot represent any of the wide codepoints, the
> wideness of codepoints in shift jis can be derived without doing a
> full decoding.
>
> Suggested resolution:
> For a string decoded to a sequence of unicode codepoints, its width is
> the sum of estimated widths of the first code points in its extended
> grapheme clusters.
>
> If the intent is for implementers to throw their hands in the air when
> the encoding is not "a unicode encoding", then surely
> we want to support UTF-8/16/32 and that's it. UTF-EBCDIC isn't more
> important or special than shift-jis and there is no reason for one
> encoding to have privileged handling over the other.
I think that is where we ended up; the intent is only to specify
behavior for UTF-8, UTF-16, and UTF-32. I think the best we could do for
encodings that are not defined by the C++ standard or one of its
normative references would be to add normative guidance to do likewise
for all implementation-defined encodings; in which case there would be
no need to restrict guidance to Unicode encodings; we could simply
specify widths for characters independently of how they are mapped to
any specific encoding.
>
>
> More generally, any unicode that can round trip through Unicode should
> qualify as Unicode encoding, but I don't think we have a definition of
> that anywhere.
> Unicode defines Unicode Encoding Form
> > A character encoding form that assigns each Unicode scalar value to
> a unique code unit sequence
>
> --
>
> Ie, I don't think the poll solves anything, it just uses a different
> terminology to describe the same thing (nothing in iso 10646 leads me
> to believe that "ucs encoding scheme" can only designate ucs encoding
> forms specified in iso 10646 - in addition of being obscure terminology).
It solves the issue that there is no definition for "Unicode encoding".
"UCS encoding scheme" at least has a definition. If the definition in
ISO/IEC 10646 is not clear, then I would argue that is a concern to
raise with WG2.
>
> On GB18030, it's a different character set, with its own set of encodings
I've been under the impression that, as of GB 18030-2022, use of the PUA
is no longer required because all GB 18030 specified characters are now
represented in Unicode. In other words, the Unicode repertoire is a
superset of the GB 18030 repertoire. Is that not correct? Its specified
encodings are, of course, distinct.
Tom.
>
> On Sun, Oct 2, 2022 at 10:51 PM Tom Honermann via SG16
> <sg16_at_[hidden]> wrote:
>
> The summary for the SG16 meeting held September 28th, 2022 is now
> available. For those that attended, please review and suggest
> corrections.
>
> * https://github.com/sg16-unicode/sg16-meetings/#september-28th-2022
>
> Two polls were taken during this meeting.
>
> The first was for LWG #3767 (codecvt<charN_t, char8_t, mbstate_t>
> incorrectly added to locale
> <https://cplusplus.github.io/LWG/issue3767>) to establish
> consensus on whether the codecvt facets mentioned in the issue are
> intended to be locale sensitive. The established position has been
> conveyed to LWG via GitHub issue 1310
> <https://github.com/cplusplus/papers/issues/1310>.
>
> The second was for LWG #3412 (ยง[format.string.std] references to
> "Unicode encoding" unclear
> <https://cplusplus.github.io/LWG/issue3412>) to establish
> consensus on a direction for a proposed resolution.
>
> Tom.
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>
>
Received on 2022-10-10 04:01:10