ISOCPP sg16 List: Re: Thoughts on P2728R6: Unicode in the Library, Part 1: UTF Transcoding

From: Tom Honermann <tom_at_[hidden]>
Date: Fri, 29 Sep 2023 12:50:31 -0400

On 9/13/23 3:31 PM, Zach Laine via SG16 wrote:
> On Wed, Sep 13, 2023 at 2:06 PM Tom Honermann <tom_at_[hidden]> wrote:
>> On 9/13/23 2:01 PM, Zach Laine wrote:
>>
>> On Wed, Sep 13, 2023 at 11:58 AM Tom Honermann via SG16
>> <sg16_at_[hidden]> wrote:
>>
>> The following reflects some of my personal thoughts regarding this paper and are intended to be independent of my role as SG16 chair.
>>
>> A future LEWG review will evaluate the paper for the following concerns. As indicated below, not all of them are addressed by the paper. In order to ease LEWG review, I recommend the paper be updated to add new sections to cover the missing concerns.
>>
>> Examples?
>>
>> Yes, in section 4 and in a few other places throughout the paper.
>>
>> Field experience?
>>
>> Yes, in section 6.
>>
>> Performance considerations?
>>
>> No, the word "performance" does not appear in the paper.
>>
>> Discussion of prior art?
>>
>> No, the paper does not discuss existing transcoding facilities like iconv, MultiByteToWideChar, or those provided by ICU.
>>
>> Changes Library Evolution previously requested?
>>
>> N/A.
>>
>> Wording?
>>
>> No. I think the paper presentation would be improved by moving the wording-like synopses to a wording section.
>>
>> Breaking changes?
>>
>> No. The proposal doesn't include any breaking changes, but this isn't stated explicitly.
>>
>> Feature test macro?
>>
>> Yes, in section 5.8.
>>
>> Freestanding considered?
>>
>> No.
>>
>> I find it challenging to determine what error handling semantics are intended to be supported. There is no error handling section and discussion of it is spread throughout the paper. I think it would be helpful to add an error handling section in section 5 and to consolidate discussion of that topic there. This should include a discussion of possible error handling semantics and a description of transcoding_error_handler and use_replacement_character. The semantics discussion should cover things that, as proposed, can't be done (e.g., an error handler can't control how a sequence of ill-formed code units is substituted; it can only provide the character to be substituted).
>>
>> This seems like a good idea.
>>
>> I think this section should also make it clear exactly how substitutions are performed; the current prose states, 'should use the “substitution of maximal subparts” approach'; I think we want to ensure portable behavior.
>>
>> I agree with this, but I want to see an SG-16 vote before making this change.
>>
>> That is fair. The error handling limitations constrain the options are available.
>>
>> I think the error handling section should also discuss the consequences of error handling as it relates to implementation of utf_iterator. That is, when ill-formed code units are encountered, decoding must continue until a valid character is decoded (which must then be cached, at least for an underlying input iterator) or until the end of the range is encountered (this continuation is necessary to ensure the "maximal subparts" substitution). The following sequence of dereference and advance operations must then return the code units for the substituted character followed by the code units for the cached decoded character. This implies that the transcoded code unit buffer in the iterator must be large enough to store sequences for two characters. I don't think the paper currently captures this subtlety.
>>
>> Nothing goes into that paper without being implemented. I implemented
>> utf_iterator, and never needed to keep more than one code point
>> around. I don't understand why you think otherwise; the explanation
>> above does not make sense to me. Could you rephrase perhaps?
>>
>> I went back and refreshed my memory of the maximal subpart substitution algorithm and discovered I had misremembered how it works (I was thinking more along the lines of policy 1 from PR-121). So, sorry, what I stated above is nonsense.
>>
>> However, lookahead of one code unit is required to match a maximal subpart of an ill-formed subsequence. That implies caching for that code unit lookahead is needed, at least for an underlying input iterator. I don't see that requirement reflected in the paper.
>>
>> Consider the example input sequence from PR-121. Recognition that the subsequences starting with F1, E1, and C2 are ill-formed (truncated) requires observing the first code unit for the next subsequence.
>>
>> 61 | F1 80 80 | E1 80 | C2 | 62
>> U+0061 | U+FFFD | U+FFFD | U+FFFD | U+0062
>>
>> I hope I'm making more sense this time.
> It does. That's how it's implemented, BTW. The iterator position for
> the input iterator case is that the iterator is always
> one-past-the-current-code-point. The current code point may be the
> replacement character.

This issue came up in discussion on Mattermost, so I'm reviewing this
conversation.

Zach, I'm not sure I understand your response. My expectation is that,
given a well-formed sequence like F0 80 80 80 61, upon decoding the
first code point, the next read of the base input iterator would return
61. However, if the input was instead an ill-formed sequence like F0 80
80 61, then, upon decoding the first code point (for which a replacement
character is substituted), the input iterator would be at the end of the
input and the buffer would already be holding the last code unit (61)
which represents a character still to be produced. Does that sound
right? If so, the one-past-the-current-code-point description doesn't
seem to always be correct; the buffer can hold both the current
character and (part of) the next character.

Tom.

>
> Zach

Received on 2023-09-29 16:50:32