C++ Logo

sg16

Advanced search

Thoughts on P2728R6: Unicode in the Library, Part 1: UTF Transcoding

From: Tom Honermann <tom_at_[hidden]>
Date: Wed, 13 Sep 2023 12:58:45 -0400
The following reflects some of my personal thoughts regarding this paper
and are intended to be independent of my role as SG16 chair.

A future LEWG review will evaluate the paper for the following concerns.
As indicated below, not all of them are addressed by the paper. In order
to ease LEWG review, I recommend the paper be updated to add new
sections to cover the missing concerns.

  * Examples?
      o Yes, in section 4 and in a few other places throughout the paper.
  * Field experience?
      o Yes, in section 6.
  * Performance considerations?
      o No, the word "performance" does not appear in the paper.
  * Discussion of prior art?
      o No, the paper does not discuss existing transcoding facilities
        like iconv, MultiByteToWideChar, or those provided by ICU.
  * Changes Library Evolution previously requested?
      o N/A.
  * Wording?
      o No. I think the paper presentation would be improved by moving
        the wording-like synopses to a wording section.
  * Breaking changes?
      o No. The proposal doesn't include any breaking changes, but this
        isn't stated explicitly.
  * Feature test macro?
      o Yes, in section 5.8.
  * Freestanding considered?
      o No.

I find it challenging to determine what error handling semantics are
intended to be supported. There is no error handling section and
discussion of it is spread throughout the paper. I think it would be
helpful to add an error handling section in section 5 and to consolidate
discussion of that topic there. This should include a discussion of
possible error handling semantics and a description of
transcoding_error_handler and use_replacement_character. The semantics
discussion should cover things that, as proposed, can't be done (e.g.,
an error handler can't control how a sequence of ill-formed code units
is substituted; it can only provide the character to be substituted). I
think this section should also make it clear exactly how substitutions
are performed; the current prose states, '*should* use the “substitution
of maximal subparts” approach'; I think we want to ensure portable behavior.

I think the error handling section should also discuss the consequences
of error handling as it relates to implementation of utf_iterator. That
is, when ill-formed code units are encountered, decoding must continue
until a valid character is decoded (which must then be cached, at least
for an underlying input iterator) or until the end of the range is
encountered (this continuation is necessary to ensure the "maximal
subparts" substitution). The following sequence of dereference and
advance operations must then return the code units for the substituted
character followed by the code units for the cached decoded character.
This implies that the transcoded code unit buffer in the iterator must
be large enough to store sequences for two characters. I don't think the
paper currently captures this subtlety.

Tom.

Received on 2023-09-13 16:58:47