On Tue, Apr 11, 2023 at 9:49 PM Tom Honermann via SG16 <sg16@lists.isocpp.org> wrote:

This is your friendly reminder that this meeting is taking place tomorrow.

Tom.

On 4/7/23 3:01 PM, Tom Honermann via SG16 wrote:

SG16 will hold a telecon on Wednesday, April 12th, at 19:30 UTC (timezone conversion).

For those in central Europe, please note that daylight savings time began since we last met, so this telecon will begin one hour later relative to the last telecon.

The agenda follows.

P2728R0: Unicode in the Library, Part 1: UTF Transcoding

Continue discussion.

Discussion during the 2023-03-22 SG16 telecon included the following topics:

Use of CTAD vs use of factory functions.

View adapters that place constraints on the underlying range but don't otherwise apply any adaptation (e.g., as_uf8()).

Lack of error handling policies for the transcoding algorithms.

Lack of convenient interfaces for handling code unit sequences that straddle a buffer boundary (due to network provided or segmented data).

Whether or how to expose the transcoding iterator type unpacking functionality.

Use of char32_t vs other types for holding Unicode code point values.

Whether and how to optimize the design for types historically used for character data vs the charN_t types.

The lack of standard library support for charN_t types and the impact to charN_t adoption.

Designing for composability through the use of elementary building blocks.

The possibility of removing the front, back, and insert iterators in favor of an iterator adapter.

The possibility of removing the full set of UTF converting iterators.

The need for first class support of UTF-8 data in char-based storage, possibly contingent on the choice of literal encoding.

Locale considerations and Python's move to C.UTF-8 as its default locale.

Note that many of these topics are more LEWG concerns than they are SG16 concerns. I think that is ok; the designs we forward should be guided by our expectations of what LEWG will find agreeable.

My impression of current consensus based on recent discussion is that we wish to be forward looking and focus on support for charN_t types with support for other types provided by wrappers, adapters, casts, etc... I'd like to poll this.

With regard to segmented data and handling of partial code unit sequences at the end of a segment, there are at least two concerns; 1) how to transition the boundary without treating the partial sequence as an error, and 2) how to handle the transition efficiently. Network buffers or data structure segments may provide contiguous data that can be processed optimally, but such optimizations cannot be applied to the entire sequence due to the segmentation. JeanHeyd's work in WG14 N3095 (Restartable and Non-Restartable Functions for Efficient Character Conversions) enables such data to be optimally processed by storing partial sequences in mbstate_t instances and allowing for continuation with another buffer; these are not iterator-based interfaces. The interfaces proposed in P2728R0 cannot support such optimizations; at least not until support for segmented data concepts is added to the ranges library to allow for the identification of contiguous segments (we could recognize range-of-ranges designs, but not range designs where segmentation is an internal iterator detail). I'd like to discuss whether we are comfortable with these limitations or whether we would prefer to wait for a partially-contiguous range specification so that maximally performant functionality can be provided in a range-based interface.

I'd like to spend time discussing the viability of transcoding output iterators like utf_8_to_32_out_iterator and utf_16_to_32_out_iterator. The issue is that writing a partial code unit sequence to them doesn't produce an output, so it isn't clear what happens if no further input is ever provided. Is the partial sequence silently lost? Does the iterator's destructor throw an exception or otherwise signal an error?

Candidate polls:

UTF transcoding interfaces provided by the C++ standard library should operate on charN_t types with support for other types provided by adapters.

The association of a UTF-8 encoding with a sequence of char must be explicit in the source code unless the literal encoding is UTF-8.

The association of a UTF-16 or UTF-32 encoding with a sequence of wchar_t must be explicit in the source code unless the wide literal encoding is UTF-16 or UTF-32.

char32_t should be used as the Unicode code point type within the C++ standard library.

Low level transcoding facilities (WG14 N3095) suffice for high speed handling of segmented data structures with contiguous segments; high level facilities can rely on iterators to abstract such structures.

MxN conversions where M is larger than N (e.g., UTF-8 -> UTF-32) shall be performed by view/iterator input adapters, not by output adapters.

Tom.

--
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16