Date: Sat, 18 Dec 2021 09:36:00 -0500
On 12/18/21 3:54 AM, Jens Maurer wrote:
> On 18/12/2021 00.18, Tom Honermann via SG16 wrote:
>> 8. Specify how invalid code unit sequences are to be handled. This includes specifying, at least for self synchronizing encodings like UTF-8, UTF-16, and UTF-32, how such sequences are delimited. References to the Unicode standard (as indicated in the editor notes in the linked meeting summary) and/or WhatWG Encoding Standard are advised. This also includes specifying how wide strings are handled; presumably each wchar_t value in an ill-formed code unit sequence would be formatted as a single hex escape.
> Since C++ is an ISO standard, normative references to other ISO standards
> (e.g. ISO 10646) as opposed to third-party standards (e.g. Unicode) are
> preferred, per ISO policy.
Thank you for the reminder, Jens. My comment about providing a reference
was intended for the paper prose, not the wording.
If a wording reference is needed, the following from the terms and
definitions section of ISO/IEC 10646:2020
<https://www.iso.org/standard/76835.html> (freely available from here
<https://standards.iso.org/ittf/PubliclyAvailableStandards/index.html>)
may be helpful. Unfortunately, the document does not appear to include
any discussion of these terms.
* *3.32, ill-formed code unit sequence*
"UCS code unit sequence that purports to be in a UCS encoding form
that does not conform to the specification of that encoding form"
* *3.33, ill-formed code unit subsequence*
"non-empty subsequence of a code unit sequence X that does not
contain any code units that also belong to a minimal well-formed
code unit subsequence of X"
* *3.39, minimal well-formed code unit sequence*
"well-formed code unit sequence that maps to a single UCS scalar value"
* *3.59, well-formed code unit sequence*
"UCS code unit sequence that purports to be in a UCS encoding form
that conforms to the specification of that encoding form and
contains no ill-formed code unit subsequence"
Tom.
> On 18/12/2021 00.18, Tom Honermann via SG16 wrote:
>> 8. Specify how invalid code unit sequences are to be handled. This includes specifying, at least for self synchronizing encodings like UTF-8, UTF-16, and UTF-32, how such sequences are delimited. References to the Unicode standard (as indicated in the editor notes in the linked meeting summary) and/or WhatWG Encoding Standard are advised. This also includes specifying how wide strings are handled; presumably each wchar_t value in an ill-formed code unit sequence would be formatted as a single hex escape.
> Since C++ is an ISO standard, normative references to other ISO standards
> (e.g. ISO 10646) as opposed to third-party standards (e.g. Unicode) are
> preferred, per ISO policy.
Thank you for the reminder, Jens. My comment about providing a reference
was intended for the paper prose, not the wording.
If a wording reference is needed, the following from the terms and
definitions section of ISO/IEC 10646:2020
<https://www.iso.org/standard/76835.html> (freely available from here
<https://standards.iso.org/ittf/PubliclyAvailableStandards/index.html>)
may be helpful. Unfortunately, the document does not appear to include
any discussion of these terms.
* *3.32, ill-formed code unit sequence*
"UCS code unit sequence that purports to be in a UCS encoding form
that does not conform to the specification of that encoding form"
* *3.33, ill-formed code unit subsequence*
"non-empty subsequence of a code unit sequence X that does not
contain any code units that also belong to a minimal well-formed
code unit subsequence of X"
* *3.39, minimal well-formed code unit sequence*
"well-formed code unit sequence that maps to a single UCS scalar value"
* *3.59, well-formed code unit sequence*
"UCS code unit sequence that purports to be in a UCS encoding form
that conforms to the specification of that encoding form and
contains no ill-formed code unit subsequence"
Tom.
Received on 2021-12-18 08:36:07