On Thu, Mar 18, 2021 at 10:35 PM Tom Honermann <tom@honermann.net> wrote:
On 3/18/21 5:32 AM, Corentin Jabot via SG16 wrote:


On Wed, Mar 17, 2021 at 2:22 PM Tom Honermann <tom@honermann.net> wrote:
On 3/17/21 5:23 AM, Corentin Jabot wrote:


On Tue, Mar 16, 2021 at 3:59 PM Tom Honermann via SG16 <sg16@lists.isocpp.org> wrote:

SG16 will hold a telecon on Wednesday, March 24th at 19:30 UTC (timezone conversion).

For participants in North America, please note that daylight savings time went into effect this past weekend, so this telecon will start one hour later than our last telecon (Mexico doesn't observe DST until April 4th).

The agenda is:

For D2314R1 and D2297R1, discussion will be limited to new information that might help to break the stalemate regarding use of an abstract character set or UCS scalar values as the specification tool for describing translation.  If consensus is not reached, we'll poll forwarding D2314R1 with direction that EWG and/or CWG choose the wording mechanism.

Per P1000, papers targeting C++23 must be forwarded by EWG/LEWG to CWG/LWG by the February, 2022 meeting (Portland).  However, the deadline for initial papers proposing new language features is ~November, 2021.  Time is running short, and competition for time in EWG/LEWG will increase.

The following lists the current state of SG16 related papers and our C++23 effort to date.  This is presented as food for thought.  What story does this tell?  How will that story be received by the C++ community?  What should we do with our remaining time to either strengthen or change that story?  What can we realistically do to bring more direct benefits to the C++ community?  It may be interesting to review what we were thinking about during our March 13th, 2019 telecon.

These papers have been accepted for C++23:

  • P2029: Proposed resolution for core issues 411, 1656, and 2333; numeric and universal character escapes in character and string literals

These papers have been approved by EWG and are in the pipeline for CWG:

  • P1949: C++ Identifier Syntax using Unicode Standard Annex 31
  • P2201: Mixed string literal concatenation
  • P2223: Trimming whitespaces before line splicing

These papers have been approved by SG16 and are in the pipeline for EWG/LEWG:

  • P1885: Naming Text Encodings to Demystify Them
  • P2093: Formatted output
  • P2246: Character encoding of diagnostic text
  • P2316: Consistent character literal encoding

These papers are in the pipeline for EWG/LEWG, but require a revision to make progress:

  • P2071: Named universal character escapes

I would like us to make progress on that! Afaik there isn't a lot of work remaining, right?

I need to review notes, but from what I remember, only minor updates are needed to the paper; doing that is on my plate and it is realistic that I could get to it soon.

Implementing it in a compiler would help to reduce some concerns.  I'm afraid I won't have time to do that for a while though.

 

These papers are currently active in SG16:

  • D2314R1: Character sets and encodings
  • D2297R1: Wording improvements for encodings and character sets

With that summary of what we have been doing above in mind, the following lists provide some options for what we could work on next.

These are existing papers available for SG16 to prioritize: (Some of these, such as P1629, are awaiting revisions).

  • P1628: Unicode character properties
As the author I do not expect to do further work on this in the 23 cycle
That matches my expectations, thanks for confirming.
  • P1629: Standard Text Encoding
  • P1729: Text Parsing
  • P1859: Standard terminology for execution character set encodings
This is mostly superseded by 2314/2297 - we should make sure the direction are consistent
 
  • P1953: Unicode Identifiers And Reflection
This is ending progress in SG-7

That isn't the effect I would expect this paper to have on SG-7. "pending" on the other hand... ;)

 
  • P2295: Correct UTF-8 handling during phase 1 of translation
Expect a revision of that soon

And finally, here are some ideas that have been discussed, but that we do not currently have papers covering:

  • UTF-8 as a portable source file encoding (the paper Tom started and has long intended to complete).
See also P2295
Yes, clearly related.
I had started a paper on this a while back.  Yet another unfinished paper.  I'd like to see this done, but it will have no meaningful impact to the C++ community, so we should consider that when prioritizing.

I agree, and providing this would be useful for the story we tell, but I suspect won't impact actual adoption.


Speaking of which, could we possibly support format with utf format strings in the 23 cycle? I don't think it would be that much work

Ah, yes, I meant to include that in the list above, but forgot.  I agree it shouldn't be a lot of work to specify.  The (possibly) hard part will be finding consensus on what conversions should be performed.

There are two distinct concerns:

  1. If UTF strings are allowed as format strings, what conversions are performed on char and wchar_t based field arguments?
    std::string s = ...;
    std::format(u"{}", s);
  2. If UTF strings are allowed as field arguments, what conversions are performed when the format string is char or wchar_t based?
    std::u16string s = ...;
    std::format("{}", s);
I think that the first concern is a much more tractable concern than the second one.
We just need to convince ourselves that any string literal can be interpreted by the execution encoding :)


I am not sure 2. _ever_ makes sense as we decided std::format(""); operates on bytes rather than text, so how do you represent text in bytes?
And the conversion is not lossless if we assume the format string is in execution encoding.
I think "if you want utf" use the  std::format(u8"{}") overload is a nice approach!
 

The answers to those questions may be dependent on:

  • The literal encoding selected at compile-time.
  • The locale dependent system encoding selected at run-time.

I filed the following github issue to track this:

The model I've been leaning towards is (brutally) detailed below.  The goal is to avoid locale dependent encoding where literal encoding choices (reasonably) rule out locale dependent run-time encodings.

  • If the format string is UTF-based, then:
    • char8_t, char16_t, and char32_t based field arguments are converted to the (UTF) encoding of the format string (not locale dependent).
    • char based field arguments are converted as follows:
      • If the literal encoding is UTF-8, conversion is from UTF-8 to the (UTF) encoding of the format string (not locale dependent).
      • Otherwise, conversion is as if by, for example, mbrtoc16() (locale dependent).
    • wchar_t based field arguments are converted as follows:
      • If the wide literal encoding is a UTF encoding, conversion is from that (UTF) encoding to the (UTF) encoding of the format string (not locale dependent).
      • Otherwise, conversion is as if by, for example, wcrtoc16() (if such a conversion function existed; locale dependent).
  • Otherwise, if the format string is char based:
    • If the literal encoding is a UTF encoding:
      • char8_t, char16_t, and char32_t based field arguments are converted to that (UTF) encoding (not locale dependent).
      • wchar_t based field arguments are converted as follows:
        • If the wide literal encoding is a UTF encoding, conversion is from that (UTF) encoding to the (UTF) encoding of the format string (not locale dependent).
        • Otherwise, conversion is as if by, for example, a char based wcrtoc8() (if such a conversion function existed; locale dependent).
    • Otherwise:wide
      • char8_t, char16_t, and char32_t based field arguments are converted as if by, for example, c16rtomb() (locale dependent).
      • wchar_t based field arguments are converted as follows:
        • If the wide literal encoding is a UTF encoding, conversion is as if by, for example, a wchar_t based c16rtomb() (if such a conversion function existed; locale dependent).
        • Otherwise, conversion is as if by wcrtomb() (locale dependent).
  • Otherwise (the format string is wchar_t based):
    • If the wide literal encoding is a UTF encoding:
      • char8_t, char16_t, and char32_t based field arguments are converted to that (UTF) encoding (not locale dependent).
      • char based field arguments are converted as follows:
        • If the literal encoding is a UTF encoding, conversion is from that (UTF) encoding to the (UTF) encoding of the format string (not locale dependent).
        • Otherwise, conversion is as if by, for example, a wchar_t based mbrtoc16() (if such a conversion function existed; locale dependent).
    • Otherwise:
      • char8_t, char16_t, and char32_t based field arguments are converted as if by, for example, c16rtowc() (if such a conversion function existed; locale dependent).
      • char based field arguments are converted as follows:
        • If the literal encoding is a UTF encoding, conversion is as if by, for example, a char based c8rtowc() (if such a conversion function existed; locale dependent).
        • Otherwise, conversion is as if by mbrtowc() (locale dependent).

Tom.