C++ Logo

sg16

Advanced search

Re: Thoughts on P2728R6: Unicode in the Library, Part 1: UTF Transcoding

From: Zach Laine <whatwasthataddress_at_[hidden]>
Date: Wed, 13 Sep 2023 13:01:40 -0500
On Wed, Sep 13, 2023 at 11:58 AM Tom Honermann via SG16
<sg16_at_[hidden]> wrote:
>
> The following reflects some of my personal thoughts regarding this paper and are intended to be independent of my role as SG16 chair.
>
> A future LEWG review will evaluate the paper for the following concerns. As indicated below, not all of them are addressed by the paper. In order to ease LEWG review, I recommend the paper be updated to add new sections to cover the missing concerns.
>
> Examples?
>
> Yes, in section 4 and in a few other places throughout the paper.
>
> Field experience?
>
> Yes, in section 6.
>
> Performance considerations?
>
> No, the word "performance" does not appear in the paper.
>
> Discussion of prior art?
>
> No, the paper does not discuss existing transcoding facilities like iconv, MultiByteToWideChar, or those provided by ICU.
>
> Changes Library Evolution previously requested?
>
> N/A.
>
> Wording?
>
> No. I think the paper presentation would be improved by moving the wording-like synopses to a wording section.
>
> Breaking changes?
>
> No. The proposal doesn't include any breaking changes, but this isn't stated explicitly.
>
> Feature test macro?
>
> Yes, in section 5.8.
>
> Freestanding considered?
>
> No.
>
> I find it challenging to determine what error handling semantics are intended to be supported. There is no error handling section and discussion of it is spread throughout the paper. I think it would be helpful to add an error handling section in section 5 and to consolidate discussion of that topic there. This should include a discussion of possible error handling semantics and a description of transcoding_error_handler and use_replacement_character. The semantics discussion should cover things that, as proposed, can't be done (e.g., an error handler can't control how a sequence of ill-formed code units is substituted; it can only provide the character to be substituted).

This seems like a good idea.

> I think this section should also make it clear exactly how substitutions are performed; the current prose states, 'should use the “substitution of maximal subparts” approach'; I think we want to ensure portable behavior.

I agree with this, but I want to see an SG-16 vote before making this change.

> I think the error handling section should also discuss the consequences of error handling as it relates to implementation of utf_iterator. That is, when ill-formed code units are encountered, decoding must continue until a valid character is decoded (which must then be cached, at least for an underlying input iterator) or until the end of the range is encountered (this continuation is necessary to ensure the "maximal subparts" substitution). The following sequence of dereference and advance operations must then return the code units for the substituted character followed by the code units for the cached decoded character. This implies that the transcoded code unit buffer in the iterator must be large enough to store sequences for two characters. I don't think the paper currently captures this subtlety.

Nothing goes into that paper without being implemented. I implemented
utf_iterator, and never needed to keep more than one code point
around. I don't understand why you think otherwise; the explanation
above does not make sense to me. Could you rephrase perhaps?

Zach

Received on 2023-09-13 18:01:53