ISOCPP sg16 List: Agenda for the 2023-04-12 SG16 telecon

From: Tom Honermann <tom_at_[hidden]>
Date: Fri, 7 Apr 2023 15:01:20 -0400

SG16 will hold a telecon on Wednesday, April 12th, at 19:30 UTC
(timezone conversion
<https://www.timeanddate.com/worldclock/converter.html?iso=20230412T193000&p1=1440&p2=tz_pt&p3=tz_mt&p4=tz_ct&p5=tz_et&p6=tz_cest>).

*For those in central Europe, please note that daylight savings time
began since we last met, so this telecon will begin one hour later
relative to the last telecon.*

The agenda follows.

  * P2728R0 <https://wg21.link/p2728r0>: Unicode in the Library, Part 1:
    UTF Transcoding
      o Continue discussion.

Discussion during the 2023-03-22 SG16 telecon
<https://github.com/sg16-unicode/sg16-meetings#march-22nd-2023> included
the following topics:

  * Use of CTAD vs use of factory functions.
  * View adapters that place constraints on the underlying range but
    don't otherwise apply any adaptation (e.g., as_uf8()).
  * Lack of error handling policies for the transcoding algorithms.
  * Lack of convenient interfaces for handling code unit sequences that
    straddle a buffer boundary (due to network provided or segmented data).
  * Whether or how to expose the transcoding iterator type unpacking
    functionality.
  * Use of char32_t vs other types for holding Unicode code point values.
  * Whether and how to optimize the design for types historically used
    for character data vs the charN_t types.
  * The lack of standard library support for charN_t types and the
    impact to charN_t adoption.
  * Designing for composability through the use of elementary building
    blocks.
  * The possibility of removing the front, back, and insert iterators in
    favor of an iterator adapter.
  * The possibility of removing the full set of UTF converting iterators.
  * The need for first class support of UTF-8 data in char-based
    storage, possibly contingent on the choice of literal encoding.
  * Locale considerations and Python's move to C.UTF-8 as its default
    locale.

Note that many of these topics are more LEWG concerns than they are SG16
concerns. I think that is ok; the designs we forward should be guided by
our expectations of what LEWG will find agreeable.

My impression of current consensus based on recent discussion is that we
wish to be forward looking and focus on support for charN_t types with
support for other types provided by wrappers, adapters, casts, etc...
I'd like to poll this.

With regard to segmented data and handling of partial code unit
sequences at the end of a segment, there are at least two concerns; 1)
how to transition the boundary without treating the partial sequence as
an error, and 2) how to handle the transition efficiently. Network
buffers or data structure segments may provide contiguous data that can
be processed optimally, but such optimizations cannot be applied to the
entire sequence due to the segmentation. JeanHeyd's work in WG14 N3095
(Restartable and Non-Restartable Functions for Efficient Character
Conversions)
<https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3095.htm> enables
such data to be optimally processed by storing partial sequences in
mbstate_t instances and allowing for continuation with another buffer;
these are not iterator-based interfaces. The interfaces proposed in
P2728R0 <https://wg21.link/p2728r0> cannot support such optimizations;
at least not until support for segmented data concepts is added to the
ranges library to allow for the identification of contiguous segments
(we could recognize range-of-ranges designs, but not range designs where
segmentation is an internal iterator detail). I'd like to discuss
whether we are comfortable with these limitations or whether we would
prefer to wait for a partially-contiguous range specification so that
maximally performant functionality can be provided in a range-based
interface.

I'd like to spend time discussing the viability of transcoding output
iterators like utf_8_to_32_out_iterator and utf_16_to_32_out_iterator.
The issue is that writing a partial code unit sequence to them doesn't
produce an output, so it isn't clear what happens if no further input is
ever provided. Is the partial sequence silently lost? Does the
iterator's destructor throw an exception or otherwise signal an error?

  Candidate polls:

1. UTF transcoding interfaces provided by the C++ standard library
    should operate on charN_t types with support for other types
    provided by adapters.
2. The association of a UTF-8 encoding with a sequence of char must be
    explicit in the source code unless the literal encoding is UTF-8.
3. The association of a UTF-16 or UTF-32 encoding with a sequence of
    wchar_t must be explicit in the source code unless the wide literal
    encoding is UTF-16 or UTF-32.
4. char32_t should be used as the Unicode code point type within the
    C++ standard library.
5. Low level transcoding facilities (WG14 N3095
    <https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3095.htm>)
    suffice for high speed handling of segmented data structures with
    contiguous segments; high level facilities can rely on iterators to
    abstract such structures.
6. /M/x/N/ conversions where /M/ is larger than /N/ (e.g., UTF-8 ->
    UTF-32) shall be performed by view/iterator input adapters, not by
    output adapters.

Tom.

Received on 2023-04-07 19:01:24