ISOCPP sg16 List: Re: Thoughts on P2728R6: Unicode in the Library, Part 1: UTF Transcoding

From: Tom Honermann <tom_at_[hidden]>
Date: Wed, 13 Sep 2023 15:06:01 -0400

On 9/13/23 2:01 PM, Zach Laine wrote:
> On Wed, Sep 13, 2023 at 11:58 AM Tom Honermann via SG16
> <sg16_at_[hidden]> wrote:
>> The following reflects some of my personal thoughts regarding this paper and are intended to be independent of my role as SG16 chair.
>>
>> A future LEWG review will evaluate the paper for the following concerns. As indicated below, not all of them are addressed by the paper. In order to ease LEWG review, I recommend the paper be updated to add new sections to cover the missing concerns.
>>
>> Examples?
>>
>> Yes, in section 4 and in a few other places throughout the paper.
>>
>> Field experience?
>>
>> Yes, in section 6.
>>
>> Performance considerations?
>>
>> No, the word "performance" does not appear in the paper.
>>
>> Discussion of prior art?
>>
>> No, the paper does not discuss existing transcoding facilities like iconv, MultiByteToWideChar, or those provided by ICU.
>>
>> Changes Library Evolution previously requested?
>>
>> N/A.
>>
>> Wording?
>>
>> No. I think the paper presentation would be improved by moving the wording-like synopses to a wording section.
>>
>> Breaking changes?
>>
>> No. The proposal doesn't include any breaking changes, but this isn't stated explicitly.
>>
>> Feature test macro?
>>
>> Yes, in section 5.8.
>>
>> Freestanding considered?
>>
>> No.
>>
>> I find it challenging to determine what error handling semantics are intended to be supported. There is no error handling section and discussion of it is spread throughout the paper. I think it would be helpful to add an error handling section in section 5 and to consolidate discussion of that topic there. This should include a discussion of possible error handling semantics and a description of transcoding_error_handler and use_replacement_character. The semantics discussion should cover things that, as proposed, can't be done (e.g., an error handler can't control how a sequence of ill-formed code units is substituted; it can only provide the character to be substituted).
> This seems like a good idea.
>
>> I think this section should also make it clear exactly how substitutions are performed; the current prose states, 'should use the “substitution of maximal subparts” approach'; I think we want to ensure portable behavior.
> I agree with this, but I want to see an SG-16 vote before making this change.
That is fair. The error handling limitations constrain the options are
available.
>
>> I think the error handling section should also discuss the consequences of error handling as it relates to implementation of utf_iterator. That is, when ill-formed code units are encountered, decoding must continue until a valid character is decoded (which must then be cached, at least for an underlying input iterator) or until the end of the range is encountered (this continuation is necessary to ensure the "maximal subparts" substitution). The following sequence of dereference and advance operations must then return the code units for the substituted character followed by the code units for the cached decoded character. This implies that the transcoded code unit buffer in the iterator must be large enough to store sequences for two characters. I don't think the paper currently captures this subtlety.
> Nothing goes into that paper without being implemented. I implemented
> utf_iterator, and never needed to keep more than one code point
> around. I don't understand why you think otherwise; the explanation
> above does not make sense to me. Could you rephrase perhaps?

I went back and refreshed my memory of the maximal subpart substitution
algorithm and discovered I had misremembered how it works (I was
thinking more along the lines of policy 1 from PR-121
<http://unicode.org/review/pr-121.html>). So, sorry, what I stated above
is nonsense.

However, lookahead of one code unit is required to match a maximal
subpart of an ill-formed subsequence. That implies caching for that code
unit lookahead is needed, at least for an underlying input iterator. I
don't see that requirement reflected in the paper.

Consider the example input sequence from PR-121. Recognition that the
subsequences starting with F1, E1, and C2 are ill-formed (truncated)
requires observing the first code unit for the next subsequence.

61 | F1 80 80 | E1 80 | C2 | 62
U+0061 | U+FFFD | U+FFFD | U+FFFD | U+0062

I hope I'm making more sense this time.

Tom.

>
> Zach

Received on 2023-09-13 19:06:02