sg16: Re: [SG16] Agenda for the 2021-03-24 SG16 telecon

From: Tom Honermann <tom_at_[hidden]>
Date: Thu, 18 Mar 2021 17:35:38 -0400

On 3/18/21 5:32 AM, Corentin Jabot via SG16 wrote:
>
>
> On Wed, Mar 17, 2021 at 2:22 PM Tom Honermann <tom_at_[hidden]
> <mailto:tom_at_[hidden]>> wrote:
>
> On 3/17/21 5:23 AM, Corentin Jabot wrote:
>>
>>
>> On Tue, Mar 16, 2021 at 3:59 PM Tom Honermann via SG16
>> <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>>
>> SG16 will hold a telecon on Wednesday, March 24th at 19:30
>> UTC (timezone conversion
>> <https://www.timeanddate.com/worldclock/converter.html?iso=20210324T193000&p1=1440&p2=tz_pdt&p3=tz_mdt&p4=tz_cdt&p5=tz_edt&p6=tz_cet>).
>>
>> *For participants in North America, please note that daylight
>> savings time went into effect this past weekend, so this
>> telecon will start one hour later than our last telecon
>> (Mexico doesn't observe DST until April 4th).*
>>
>> The agenda is:
>>
>> * Continue discussion from the last telecon concerning:
>> o D2314R1: Character sets and encodings
>> <https://wiki.edg.com/pub/Wg21telecons2021/SG16/d2314r1.html>
>> o D2297R1: Wording improvements for encodings and
>> character sets
>> <https://isocpp.org/files/papers/D2297R1.pdf>
>> * Discuss priorities and goals for C++23.
>>
>> For D2314R1
>> <https://wiki.edg.com/pub/Wg21telecons2021/SG16/d2314r1.html>
>> and D2297R1 <https://isocpp.org/files/papers/D2297R1.pdf>,
>> discussion will be limited to new information that might help
>> to break the stalemate regarding use of an abstract character
>> set or UCS scalar values as the specification tool for
>> describing translation. If consensus is not reached, we'll
>> poll forwarding D2314R1
>> <https://wiki.edg.com/pub/Wg21telecons2021/SG16/d2314r1.html>
>> with direction that EWG and/or CWG choose the wording mechanism.
>>
>> Per P1000 <https://wg21.link/p1000>, papers targeting C++23
>> must be forwarded by EWG/LEWG to CWG/LWG by the February,
>> 2022 meeting (Portland). However, the deadline for initial
>> papers proposing new language features is ~November, 2021.
>> Time is running short, and competition for time in EWG/LEWG
>> will increase.
>>
>> The following lists the current state of SG16 related papers
>> and our C++23 effort to date. This is presented as food for
>> thought. What story does this tell? How will that story be
>> received by the C++ community? What should we do with our
>> remaining time to either strengthen or change that story?
>> What can we realistically do to bring more direct benefits to
>> the C++ community? It may be interesting to review what we
>> were thinking about during our March 13th, 2019 telecon
>> <https://github.com/sg16-unicode/sg16-meetings/blob/master/README-2019.md#march-13th-2019>.
>>
>> These papers have been accepted for C++23:
>>
>> * P2029 <https://wg21.link/p2029>: Proposed resolution for
>> core issues 411, 1656, and 2333; numeric and universal
>> character escapes in character and string literals
>>
>> These papers have been approved by EWG and are in the
>> pipeline for CWG:
>>
>> * P1949 <https://wg21.link/p1949>: C++ Identifier Syntax
>> using Unicode Standard Annex 31
>> * P2201 <https://wg21.link/p2201>: Mixed string literal
>> concatenation
>> * P2223 <https://wg21.link/p2223>: Trimming whitespaces
>> before line splicing
>>
>> These papers have been approved by SG16 and are in the
>> pipeline for EWG/LEWG:
>>
>> * P1885 <https://wg21.link/p1885>: Naming Text Encodings to
>> Demystify Them
>> * P2093 <https://wg21.link/p2093>: Formatted output
>> * P2246 <https://wg21.link/p2246>: Character encoding of
>> diagnostic text
>> * P2316 <https://wg21.link/p2316>: Consistent character
>> literal encoding
>>
>> These papers are in the pipeline for EWG/LEWG, but require a
>> revision to make progress:
>>
>> * P2071 <https://wg21.link/p2071>: Named universal
>> character escapes
>>
>>
>> I would like us to make progress on that! Afaik there isn't a lot
>> of work remaining, right?
>
> I need to review notes, but from what I remember, only minor
> updates are needed to the paper; doing that is on my plate and it
> is realistic that I could get to it soon.
>
> Implementing it in a compiler would help to reduce some concerns.
> I'm afraid I won't have time to do that for a while though.
>
>> *
>>
>>
>> These papers are currently active in SG16:
>>
>> * D2314R1
>> <https://wiki.edg.com/pub/Wg21telecons2021/SG16/d2314r1.html>:
>> Character sets and encodings
>> * D2297R1 <https://isocpp.org/files/papers/D2297R1.pdf>:
>> Wording improvements for encodings and character sets
>>
>> With that summary of what we have been doing above in mind,
>> the following lists provide some options for what we could
>> work on next.
>>
>> These are existing papers available for SG16 to prioritize:
>> (Some of these, such as P1629, are awaiting revisions).
>>
>> * P1628 <https://wg21.link/p1628>: Unicode character properties
>>
>> As the author I do not expect to do further work on this in the
>> 23 cycle
> That matches my expectations, thanks for confirming.
>>
>> * P1629 <https://wg21.link/p1629>: Standard Text Encoding
>> * P1729 <https://wg21.link/p1729>: Text Parsing
>> * P1859 <https://wg21.link/p1859>: Standard terminology for
>> execution character set encodings
>>
>> This is mostly superseded by 2314/2297 - we should make sure the
>> direction are consistent
>>
>> * P1953 <https://wg21.link/p1953>: Unicode Identifiers And
>> Reflection
>>
>> This is ending progress in SG-7
>
> That isn't the effect I would expect this paper to have on SG-7.
> "pending" on the other hand... ;)
>
>> * P2295 <https://wg21.link/p2295>: Correct UTF-8 handling
>> during phase 1 of translation
>>
>> Expect a revision of that soon
>>
>> *
>>
>>
>> And finally, here are some ideas that have been discussed,
>> but that we do not currently have papers covering:
>>
>> * UTF-8 as a portable source file encoding (the paper Tom
>> started and has long intended to complete).
>>
>> See also P2295
> Yes, clearly related.
>>
>> * Requiring wchar_t to represent all members of the
>> execution wide character set does not match existing
>> practice <https://github.com/sg16-unicode/sg16/issues/9>
>>
>> Please let's investigate that!
> I had started a paper on this a while back. Yet another
> unfinished paper. I'd like to see this done, but it will have no
> meaningful impact to the C++ community, so we should consider that
> when prioritizing.
>>
>> * WG21 P1854: Source to Execution encoding conversion
>> should not lead to loss of information
>> <https://github.com/sg16-unicode/sg16/issues/50>
>>
>> Expect further work on that in the coming months
>>
>> * Deprecate std::regex
>> <https://github.com/sg16-unicode/sg16/issues/57>
>> * Make wide multicharacter character literals ill-formed
>> <https://github.com/sg16-unicode/sg16/issues/65>
>>
>> I'll write a paper
>>
>> *
>>
>>
>> * Improve portable ingestion of command-line arguments
>> <https://github.com/sg16-unicode/sg16/issues/66>
>> * Alias barriers; a replacement for the ICU hack
>> <https://github.com/sg16-unicode/sg16/issues/67>
>>
>> This seems very important - the char8_t adoption story isn't
>> great right now.
>
> I agree, and providing this would be useful for the story we tell,
> but I suspect won't impact actual adoption.
>
>
> Speaking of which, could we possibly support format with utf format
> strings in the 23 cycle? I don't think it would be that much work

Ah, yes, I meant to include that in the list above, but forgot. I agree
it shouldn't be a lot of work to specify. The (possibly) hard part will
be finding consensus on what conversions should be performed.

There are two distinct concerns:

1. If UTF strings are allowed as format strings, what conversions are
    performed on char and wchar_t based field arguments?
    std::string s = ...;
    std::format(u"{}", s);
2. If UTF strings are allowed as field arguments, what conversions are
    performed when the format string is char or wchar_t based?
    std::u16string s = ...;
    std::format("{}", s);

The answers to those questions may be dependent on:

  * The literal encoding selected at compile-time.
  * The locale dependent system encoding selected at run-time.

I filed the following github issue to track this:

  * Support for UTF encodings in std::format() and std::print()
    <https://github.com/sg16-unicode/sg16/issues/68>

The model I've been leaning towards is (brutally) detailed below. The
goal is to avoid locale dependent encoding where literal encoding
choices (reasonably) rule out locale dependent run-time encodings.

  * If the format string is UTF-based, then:
      o char8_t, char16_t, and char32_t based field arguments are
        converted to the (UTF) encoding of the format string (not locale
        dependent).
      o char based field arguments are converted as follows:
          + If the literal encoding is UTF-8, conversion is from UTF-8
            to the (UTF) encoding of the format string (not locale
            dependent).
          + Otherwise, conversion is as if by, for example, mbrtoc16()
            (locale dependent).
      o wchar_t based field arguments are converted as follows:
          + If the wide literal encoding is a UTF encoding, conversion
            is from that (UTF) encoding to the (UTF) encoding of the
            format string (not locale dependent).
          + Otherwise, conversion is as if by, for example, wcrtoc16()
            (if such a conversion function existed; locale dependent).
  * Otherwise, if the format string is char based:
      o If the literal encoding is a UTF encoding:
          + char8_t, char16_t, and char32_t based field arguments are
            converted to that (UTF) encoding (not locale dependent).
          + wchar_t based field arguments are converted as follows:
              # If the wide literal encoding is a UTF encoding,
                conversion is from that (UTF) encoding to the (UTF)
                encoding of the format string (not locale dependent).
              # Otherwise, conversion is as if by, for example, a char
                based wcrtoc8() (if such a conversion function existed;
                locale dependent).
      o Otherwise:wide
          + char8_t, char16_t, and char32_t based field arguments are
            converted as if by, for example, c16rtomb() (locale dependent).
          + wchar_t based field arguments are converted as follows:
              # If the wide literal encoding is a UTF encoding,
                conversion is as if by, for example, a wchar_t based
                c16rtomb() (if such a conversion function existed;
                locale dependent).
              # Otherwise, conversion is as if by wcrtomb() (locale
                dependent).
  * Otherwise (the format string is wchar_t based):
      o If the wide literal encoding is a UTF encoding:
          + char8_t, char16_t, and char32_t based field arguments are
            converted to that (UTF) encoding (not locale dependent).
          + char based field arguments are converted as follows:
              # If the literal encoding is a UTF encoding, conversion is
                from that (UTF) encoding to the (UTF) encoding of the
                format string (not locale dependent).
              # Otherwise, conversion is as if by, for example, a
                wchar_t based mbrtoc16() (if such a conversion function
                existed; locale dependent).
      o Otherwise:
          + char8_t, char16_t, and char32_t based field arguments are
            converted as if by, for example, c16rtowc() (if such a
            conversion function existed; locale dependent).
          + char based field arguments are converted as follows:
              # If the literal encoding is a UTF encoding, conversion is
                as if by, for example, a char based c8rtowc() (if such a
                conversion function existed; locale dependent).
              # Otherwise, conversion is as if by mbrtowc() (locale
                dependent).

Tom.

> Tom.
>
>> Our efforts will need to be balanced with any effort expended
>> to align C23 with changes made for C++20 and C++23:
>>
>> * WG14 N2231: char8_t: A type for UTF-8 characters and
>> strings <https://github.com/sg16-unicode/sg16/issues/5>
>> * WG14: Make char16_t/char32_t string literals be UTF-16/32
>> <https://github.com/sg16-unicode/sg16/issues/54>
>> * WG14: Improve support for Unicode characters in
>> identifiers <https://github.com/sg16-unicode/sg16/issues/56>
>> * WG14: numerical & universal character escapes in char &
>> string literals
>> <https://github.com/sg16-unicode/sg16/issues/63>
>> * WG14: Trimming whitespace before line splicing
>> <https://github.com/sg16-unicode/sg16/issues/64>
>>
>> Tom.
>>
>> --
>> SG16 mailing list
>> SG16_at_[hidden] <mailto:SG16_at_[hidden]>
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>
>
>

Received on 2021-03-18 16:35:43