C++ Logo


Advanced search

Re: Agenda for the 2023-04-12 SG16 telecon​

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Wed, 12 Apr 2023 21:36:44 +0200
This is happening now

On Tue, Apr 11, 2023 at 9:49 PM Tom Honermann via SG16 <
sg16_at_[hidden]> wrote:

> This is your friendly reminder that this meeting is taking place tomorrow.
> Tom.
> On 4/7/23 3:01 PM, Tom Honermann via SG16 wrote:
> SG16 will hold a telecon on Wednesday, April 12th, at 19:30 UTC (timezone
> conversion
> <https://www.timeanddate.com/worldclock/converter.html?iso=20230412T193000&p1=1440&p2=tz_pt&p3=tz_mt&p4=tz_ct&p5=tz_et&p6=tz_cest>
> ).
> *For those in central Europe, please note that daylight savings time began
> since we last met, so this telecon will begin one hour later relative to
> the last telecon.*
> The agenda follows.
> - P2728R0 <https://wg21.link/p2728r0>: Unicode in the Library, Part 1:
> UTF Transcoding
> - Continue discussion.
> Discussion during the 2023-03-22 SG16 telecon
> <https://github.com/sg16-unicode/sg16-meetings#march-22nd-2023> included
> the following topics:
> - Use of CTAD vs use of factory functions.
> - View adapters that place constraints on the underlying range but
> don't otherwise apply any adaptation (e.g., as_uf8()).
> - Lack of error handling policies for the transcoding algorithms.
> - Lack of convenient interfaces for handling code unit sequences that
> straddle a buffer boundary (due to network provided or segmented data).
> - Whether or how to expose the transcoding iterator type unpacking
> functionality.
> - Use of char32_t vs other types for holding Unicode code point values.
> - Whether and how to optimize the design for types historically used
> for character data vs the charN_t types.
> - The lack of standard library support for charN_t types and the
> impact to charN_t adoption.
> - Designing for composability through the use of elementary building
> blocks.
> - The possibility of removing the front, back, and insert iterators in
> favor of an iterator adapter.
> - The possibility of removing the full set of UTF converting iterators.
> - The need for first class support of UTF-8 data in char-based
> storage, possibly contingent on the choice of literal encoding.
> - Locale considerations and Python's move to C.UTF-8 as its default
> locale.
> Note that many of these topics are more LEWG concerns than they are SG16
> concerns. I think that is ok; the designs we forward should be guided by
> our expectations of what LEWG will find agreeable.
> My impression of current consensus based on recent discussion is that we
> wish to be forward looking and focus on support for charN_t types with
> support for other types provided by wrappers, adapters, casts, etc... I'd
> like to poll this.
> With regard to segmented data and handling of partial code unit sequences
> at the end of a segment, there are at least two concerns; 1) how to
> transition the boundary without treating the partial sequence as an error,
> and 2) how to handle the transition efficiently. Network buffers or data
> structure segments may provide contiguous data that can be processed
> optimally, but such optimizations cannot be applied to the entire sequence
> due to the segmentation. JeanHeyd's work in WG14 N3095 (Restartable and
> Non-Restartable Functions for Efficient Character Conversions)
> <https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3095.htm> enables such
> data to be optimally processed by storing partial sequences in mbstate_t
> instances and allowing for continuation with another buffer; these are not
> iterator-based interfaces. The interfaces proposed in P2728R0
> <https://wg21.link/p2728r0> cannot support such optimizations; at least
> not until support for segmented data concepts is added to the ranges
> library to allow for the identification of contiguous segments (we could
> recognize range-of-ranges designs, but not range designs where segmentation
> is an internal iterator detail). I'd like to discuss whether we are
> comfortable with these limitations or whether we would prefer to wait for a
> partially-contiguous range specification so that maximally performant
> functionality can be provided in a range-based interface.
> I'd like to spend time discussing the viability of transcoding output
> iterators like utf_8_to_32_out_iterator and utf_16_to_32_out_iterator.
> The issue is that writing a partial code unit sequence to them doesn't
> produce an output, so it isn't clear what happens if no further input is
> ever provided. Is the partial sequence silently lost? Does the iterator's
> destructor throw an exception or otherwise signal an error?
> Candidate polls:
> 1. UTF transcoding interfaces provided by the C++ standard library
> should operate on charN_t types with support for other types provided
> by adapters.
> 2. The association of a UTF-8 encoding with a sequence of char must be
> explicit in the source code unless the literal encoding is UTF-8.
> 3. The association of a UTF-16 or UTF-32 encoding with a sequence of
> wchar_t must be explicit in the source code unless the wide literal
> encoding is UTF-16 or UTF-32.
> 4. char32_t should be used as the Unicode code point type within the
> C++ standard library.
> 5. Low level transcoding facilities (WG14 N3095
> <https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3095.htm>) suffice
> for high speed handling of segmented data structures with contiguous
> segments; high level facilities can rely on iterators to abstract such
> structures.
> 6. *M*x*N* conversions where *M* is larger than *N* (e.g., UTF-8 ->
> UTF-32) shall be performed by view/iterator input adapters, not by output
> adapters.
> Tom.
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16

Received on 2023-04-12 19:36:58