sg16: Re: [SG16] Agenda for the 2021-03-24 SG16 telecon

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Fri, 19 Mar 2021 09:47:55 +0100

On Thu, Mar 18, 2021 at 10:35 PM Tom Honermann <tom_at_[hidden]> wrote:

> On 3/18/21 5:32 AM, Corentin Jabot via SG16 wrote:
>
>
>
> On Wed, Mar 17, 2021 at 2:22 PM Tom Honermann <tom_at_[hidden]> wrote:
>
>> On 3/17/21 5:23 AM, Corentin Jabot wrote:
>>
>>
>>
>> On Tue, Mar 16, 2021 at 3:59 PM Tom Honermann via SG16 <
>> sg16_at_[hidden]> wrote:
>>
>>> SG16 will hold a telecon on Wednesday, March 24th at 19:30 UTC (timezone
>>> conversion
>>> <https://www.timeanddate.com/worldclock/converter.html?iso=20210324T193000&p1=1440&p2=tz_pdt&p3=tz_mdt&p4=tz_cdt&p5=tz_edt&p6=tz_cet>
>>> ).
>>>
>>> *For participants in North America, please note that daylight savings
>>> time went into effect this past weekend, so this telecon will start one
>>> hour later than our last telecon (Mexico doesn't observe DST until April
>>> 4th).*
>>>
>>> The agenda is:
>>>
>>> - Continue discussion from the last telecon concerning:
>>> - D2314R1: Character sets and encodings
>>> <https://wiki.edg.com/pub/Wg21telecons2021/SG16/d2314r1.html>
>>> - D2297R1: Wording improvements for encodings and character sets
>>> <https://isocpp.org/files/papers/D2297R1.pdf>
>>> - Discuss priorities and goals for C++23.
>>>
>>> For D2314R1
>>> <https://wiki.edg.com/pub/Wg21telecons2021/SG16/d2314r1.html> and
>>> D2297R1 <https://isocpp.org/files/papers/D2297R1.pdf>, discussion will
>>> be limited to new information that might help to break the stalemate
>>> regarding use of an abstract character set or UCS scalar values as the
>>> specification tool for describing translation. If consensus is not
>>> reached, we'll poll forwarding D2314R1
>>> <https://wiki.edg.com/pub/Wg21telecons2021/SG16/d2314r1.html> with
>>> direction that EWG and/or CWG choose the wording mechanism.
>>>
>>> Per P1000 <https://wg21.link/p1000>, papers targeting C++23 must be
>>> forwarded by EWG/LEWG to CWG/LWG by the February, 2022 meeting (Portland).
>>> However, the deadline for initial papers proposing new language features is
>>> ~November, 2021. Time is running short, and competition for time in
>>> EWG/LEWG will increase.
>>>
>>> The following lists the current state of SG16 related papers and our
>>> C++23 effort to date. This is presented as food for thought. What story
>>> does this tell? How will that story be received by the C++ community?
>>> What should we do with our remaining time to either strengthen or change
>>> that story? What can we realistically do to bring more direct benefits to
>>> the C++ community? It may be interesting to review what we were
>>> thinking about during our March 13th, 2019 telecon
>>> <https://github.com/sg16-unicode/sg16-meetings/blob/master/README-2019.md#march-13th-2019>
>>> .
>>>
>>> These papers have been accepted for C++23:
>>>
>>> - P2029 <https://wg21.link/p2029>: Proposed resolution for core
>>> issues 411, 1656, and 2333; numeric and universal character escapes in
>>> character and string literals
>>>
>>> These papers have been approved by EWG and are in the pipeline for CWG:
>>>
>>> - P1949 <https://wg21.link/p1949>: C++ Identifier Syntax using
>>> Unicode Standard Annex 31
>>> - P2201 <https://wg21.link/p2201>: Mixed string literal concatenation
>>> - P2223 <https://wg21.link/p2223>: Trimming whitespaces before line
>>> splicing
>>>
>>> These papers have been approved by SG16 and are in the pipeline for
>>> EWG/LEWG:
>>>
>>> - P1885 <https://wg21.link/p1885>: Naming Text Encodings to
>>> Demystify Them
>>> - P2093 <https://wg21.link/p2093>: Formatted output
>>> - P2246 <https://wg21.link/p2246>: Character encoding of diagnostic
>>> text
>>> - P2316 <https://wg21.link/p2316>: Consistent character literal
>>> encoding
>>>
>>> These papers are in the pipeline for EWG/LEWG, but require a revision to
>>> make progress:
>>>
>>> - P2071 <https://wg21.link/p2071>: Named universal character escapes
>>>
>>>
>> I would like us to make progress on that! Afaik there isn't a lot of work
>> remaining, right?
>>
>> I need to review notes, but from what I remember, only minor updates are
>> needed to the paper; doing that is on my plate and it is realistic that I
>> could get to it soon.
>>
>> Implementing it in a compiler would help to reduce some concerns. I'm
>> afraid I won't have time to do that for a while though.
>>
>>
>>
>>>
>>> -
>>>
>>> These papers are currently active in SG16:
>>>
>>> - D2314R1
>>> <https://wiki.edg.com/pub/Wg21telecons2021/SG16/d2314r1.html>:
>>> Character sets and encodings
>>> - D2297R1 <https://isocpp.org/files/papers/D2297R1.pdf>: Wording
>>> improvements for encodings and character sets
>>>
>>> With that summary of what we have been doing above in mind, the
>>> following lists provide some options for what we could work on next.
>>>
>>> These are existing papers available for SG16 to prioritize: (Some of
>>> these, such as P1629, are awaiting revisions).
>>>
>>> - P1628 <https://wg21.link/p1628>: Unicode character properties
>>>
>>> As the author I do not expect to do further work on this in the 23 cycle
>>
>> That matches my expectations, thanks for confirming.
>>
>>
>>> - P1629 <https://wg21.link/p1629>: Standard Text Encoding
>>> - P1729 <https://wg21.link/p1729>: Text Parsing
>>> - P1859 <https://wg21.link/p1859>: Standard terminology for
>>> execution character set encodings
>>>
>>> This is mostly superseded by 2314/2297 - we should make sure the
>> direction are consistent
>>
>>
>>>
>>> - P1953 <https://wg21.link/p1953>: Unicode Identifiers And Reflection
>>>
>>> This is ending progress in SG-7
>>
>> That isn't the effect I would expect this paper to have on SG-7.
>> "pending" on the other hand... ;)
>>
>>
>>
>>>
>>> - P2295 <https://wg21.link/p2295>: Correct UTF-8 handling during
>>> phase 1 of translation
>>>
>>> Expect a revision of that soon
>>
>>>
>>> -
>>>
>>> And finally, here are some ideas that have been discussed, but that we
>>> do not currently have papers covering:
>>>
>>> - UTF-8 as a portable source file encoding (the paper Tom started
>>> and has long intended to complete).
>>>
>>> See also P2295
>>
>> Yes, clearly related.
>>
>>
>>
>>>
>>> - Requiring wchar_t to represent all members of the execution wide
>>> character set does not match existing practice
>>> <https://github.com/sg16-unicode/sg16/issues/9>
>>>
>>> Please let's investigate that!
>>
>> I had started a paper on this a while back. Yet another unfinished
>> paper. I'd like to see this done, but it will have no meaningful impact to
>> the C++ community, so we should consider that when prioritizing.
>>
>>
>>> - WG21 P1854: Source to Execution encoding conversion should not
>>> lead to loss of information
>>> <https://github.com/sg16-unicode/sg16/issues/50>
>>>
>>> Expect further work on that in the coming months
>>
>>>
>>> - Deprecate std::regex
>>> <https://github.com/sg16-unicode/sg16/issues/57>
>>> - Make wide multicharacter character literals ill-formed
>>> <https://github.com/sg16-unicode/sg16/issues/65>
>>>
>>> I'll write a paper
>>
>>>
>>> -
>>> - Improve portable ingestion of command-line arguments
>>> <https://github.com/sg16-unicode/sg16/issues/66>
>>> - Alias barriers; a replacement for the ICU hack
>>> <https://github.com/sg16-unicode/sg16/issues/67>
>>>
>>> This seems very important - the char8_t adoption story isn't great right
>> now.
>>
>> I agree, and providing this would be useful for the story we tell, but I
>> suspect won't impact actual adoption.
>>
>
> Speaking of which, could we possibly support format with utf format
> strings in the 23 cycle? I don't think it would be that much work
>
> Ah, yes, I meant to include that in the list above, but forgot. I agree
> it shouldn't be a lot of work to specify. The (possibly) hard part will be
> finding consensus on what conversions should be performed.
>
> There are two distinct concerns:
>
> 1. If UTF strings are allowed as format strings, what conversions are
> performed on char and wchar_t based field arguments?
> std::string s = ...;
> std::format(u"{}", s);
> 2. If UTF strings are allowed as field arguments, what conversions are
> performed when the format string is char or wchar_t based?
> std::u16string s = ...;
> std::format("{}", s);
>
> I think that the first concern is a much more tractable concern than the
second one.
We just need to convince ourselves that any string literal can be
interpreted by the execution encoding :)

I am not sure 2. _ever_ makes sense as we decided std::format(""); operates
on bytes rather than text, so how do you represent text in bytes?
And the conversion is not lossless if we assume the format string is in
execution encoding.
I think "if you want utf" use the std::format(u8"{}") overload is a nice
approach!

>
> 1.
>
> The answers to those questions may be dependent on:
>
> - The literal encoding selected at compile-time.
> - The locale dependent system encoding selected at run-time.
>
> I filed the following github issue to track this:
>
> - Support for UTF encodings in std::format() and std::print()
> <https://github.com/sg16-unicode/sg16/issues/68>
>
> The model I've been leaning towards is (brutally) detailed below. The
> goal is to avoid locale dependent encoding where literal encoding choices
> (reasonably) rule out locale dependent run-time encodings.
>
> - If the format string is UTF-based, then:
> - char8_t, char16_t, and char32_t based field arguments are
> converted to the (UTF) encoding of the format string (not locale dependent).
> - char based field arguments are converted as follows:
> - If the literal encoding is UTF-8, conversion is from UTF-8 to
> the (UTF) encoding of the format string (not locale dependent).
> - Otherwise, conversion is as if by, for example, mbrtoc16()
> (locale dependent).
> - wchar_t based field arguments are converted as follows:
> - If the wide literal encoding is a UTF encoding, conversion is
> from that (UTF) encoding to the (UTF) encoding of the format string (not
> locale dependent).
> - Otherwise, conversion is as if by, for example, wcrtoc16() (if
> such a conversion function existed; locale dependent).
> - Otherwise, if the format string is char based:
> - If the literal encoding is a UTF encoding:
> - char8_t, char16_t, and char32_t based field arguments are
> converted to that (UTF) encoding (not locale dependent).
> - wchar_t based field arguments are converted as follows:
> - If the wide literal encoding is a UTF encoding, conversion
> is from that (UTF) encoding to the (UTF) encoding of the format string (not
> locale dependent).
> - Otherwise, conversion is as if by, for example, a char
> based wcrtoc8() (if such a conversion function existed;
> locale dependent).
> - Otherwise:wide
> - char8_t, char16_t, and char32_t based field arguments are
> converted as if by, for example, c16rtomb() (locale dependent).
> - wchar_t based field arguments are converted as follows:
> - If the wide literal encoding is a UTF encoding, conversion
> is as if by, for example, a wchar_t based c16rtomb() (if such
> a conversion function existed; locale dependent).
> - Otherwise, conversion is as if by wcrtomb() (locale
> dependent).
> - Otherwise (the format string is wchar_t based):
> - If the wide literal encoding is a UTF encoding:
> - char8_t, char16_t, and char32_t based field arguments are
> converted to that (UTF) encoding (not locale dependent).
> - char based field arguments are converted as follows:
> - If the literal encoding is a UTF encoding, conversion is
> from that (UTF) encoding to the (UTF) encoding of the format string (not
> locale dependent).
> - Otherwise, conversion is as if by, for example, a wchar_t
> based mbrtoc16() (if such a conversion function existed;
> locale dependent).
> - Otherwise:
> - char8_t, char16_t, and char32_t based field arguments are
> converted as if by, for example, c16rtowc() (if such a
> conversion function existed; locale dependent).
> - char based field arguments are converted as follows:
> - If the literal encoding is a UTF encoding, conversion is as
> if by, for example, a char based c8rtowc() (if such a
> conversion function existed; locale dependent).
> - Otherwise, conversion is as if by mbrtowc() (locale
> dependent).
>
> Tom.
>
>
>
>> Tom.
>>
>>
>>>
>>> Our efforts will need to be balanced with any effort expended to align
>>> C23 with changes made for C++20 and C++23:
>>>
>>> - WG14 N2231: char8_t: A type for UTF-8 characters and strings
>>> <https://github.com/sg16-unicode/sg16/issues/5>
>>> - WG14: Make char16_t/char32_t string literals be UTF-16/32
>>> <https://github.com/sg16-unicode/sg16/issues/54>
>>> - WG14: Improve support for Unicode characters in identifiers
>>> <https://github.com/sg16-unicode/sg16/issues/56>
>>> - WG14: numerical & universal character escapes in char & string
>>> literals <https://github.com/sg16-unicode/sg16/issues/63>
>>> - WG14: Trimming whitespace before line splicing
>>> <https://github.com/sg16-unicode/sg16/issues/64>
>>>
>>> Tom.
>>> --
>>> SG16 mailing list
>>> SG16_at_[hidden]
>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>>
>>
>>
>
>

Received on 2021-03-19 03:48:10