C++ Logo

sg16

Advanced search

Re: Agenda for the 2023-04-12 SG16 telecon​

From: Tom Honermann <tom_at_[hidden]>
Date: Tue, 11 Apr 2023 15:49:21 -0400
This is your friendly reminder that this meeting is taking place tomorrow.

Tom.

On 4/7/23 3:01 PM, Tom Honermann via SG16 wrote:
>
> SG16 will hold a telecon on Wednesday, April 12th, at 19:30 UTC
> (timezone conversion
> <https://www.timeanddate.com/worldclock/converter.html?iso=20230412T193000&p1=1440&p2=tz_pt&p3=tz_mt&p4=tz_ct&p5=tz_et&p6=tz_cest>).
>
> *For those in central Europe, please note that daylight savings time
> began since we last met, so this telecon will begin one hour later
> relative to the last telecon.*
>
> The agenda follows.
>
> * P2728R0 <https://wg21.link/p2728r0>: Unicode in the Library, Part
> 1: UTF Transcoding
> o Continue discussion.
>
> Discussion during the 2023-03-22 SG16 telecon
> <https://github.com/sg16-unicode/sg16-meetings#march-22nd-2023>
> included the following topics:
>
> * Use of CTAD vs use of factory functions.
> * View adapters that place constraints on the underlying range but
> don't otherwise apply any adaptation (e.g., as_uf8()).
> * Lack of error handling policies for the transcoding algorithms.
> * Lack of convenient interfaces for handling code unit sequences
> that straddle a buffer boundary (due to network provided or
> segmented data).
> * Whether or how to expose the transcoding iterator type unpacking
> functionality.
> * Use of char32_t vs other types for holding Unicode code point values.
> * Whether and how to optimize the design for types historically used
> for character data vs the charN_t types.
> * The lack of standard library support for charN_t types and the
> impact to charN_t adoption.
> * Designing for composability through the use of elementary building
> blocks.
> * The possibility of removing the front, back, and insert iterators
> in favor of an iterator adapter.
> * The possibility of removing the full set of UTF converting iterators.
> * The need for first class support of UTF-8 data in char-based
> storage, possibly contingent on the choice of literal encoding.
> * Locale considerations and Python's move to C.UTF-8 as its default
> locale.
>
> Note that many of these topics are more LEWG concerns than they are
> SG16 concerns. I think that is ok; the designs we forward should be
> guided by our expectations of what LEWG will find agreeable.
>
> My impression of current consensus based on recent discussion is that
> we wish to be forward looking and focus on support for charN_t types
> with support for other types provided by wrappers, adapters, casts,
> etc... I'd like to poll this.
>
> With regard to segmented data and handling of partial code unit
> sequences at the end of a segment, there are at least two concerns; 1)
> how to transition the boundary without treating the partial sequence
> as an error, and 2) how to handle the transition efficiently. Network
> buffers or data structure segments may provide contiguous data that
> can be processed optimally, but such optimizations cannot be applied
> to the entire sequence due to the segmentation. JeanHeyd's work in
> WG14 N3095 (Restartable and Non-Restartable Functions for Efficient
> Character Conversions)
> <https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3095.htm> enables
> such data to be optimally processed by storing partial sequences in
> mbstate_t instances and allowing for continuation with another buffer;
> these are not iterator-based interfaces. The interfaces proposed in
> P2728R0 <https://wg21.link/p2728r0> cannot support such optimizations;
> at least not until support for segmented data concepts is added to the
> ranges library to allow for the identification of contiguous segments
> (we could recognize range-of-ranges designs, but not range designs
> where segmentation is an internal iterator detail). I'd like to
> discuss whether we are comfortable with these limitations or whether
> we would prefer to wait for a partially-contiguous range specification
> so that maximally performant functionality can be provided in a
> range-based interface.
>
> I'd like to spend time discussing the viability of transcoding output
> iterators like utf_8_to_32_out_iterator and utf_16_to_32_out_iterator.
> The issue is that writing a partial code unit sequence to them doesn't
> produce an output, so it isn't clear what happens if no further input
> is ever provided. Is the partial sequence silently lost? Does the
> iterator's destructor throw an exception or otherwise signal an error?
>
> Candidate polls:
>
> 1. UTF transcoding interfaces provided by the C++ standard library
> should operate on charN_t types with support for other types
> provided by adapters.
> 2. The association of a UTF-8 encoding with a sequence of char must
> be explicit in the source code unless the literal encoding is UTF-8.
> 3. The association of a UTF-16 or UTF-32 encoding with a sequence of
> wchar_t must be explicit in the source code unless the wide
> literal encoding is UTF-16 or UTF-32.
> 4. char32_t should be used as the Unicode code point type within the
> C++ standard library.
> 5. Low level transcoding facilities (WG14 N3095
> <https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3095.htm>)
> suffice for high speed handling of segmented data structures with
> contiguous segments; high level facilities can rely on iterators
> to abstract such structures.
> 6. /M/x/N/ conversions where /M/ is larger than /N/ (e.g., UTF-8 ->
> UTF-32) shall be performed by view/iterator input adapters, not by
> output adapters.
>
> Tom.
>
>

Received on 2023-04-11 19:49:23