C++ Logo

sg16

Advanced search

Re: [SG16] P2295R3 Support for UTF-8 as a portable source file encoding

From: Corentin <corentin.jabot_at_[hidden]>
Date: Fri, 30 Apr 2021 08:15:18 +0200
On Fri, Apr 30, 2021, 06:26 Charlie Barto <Charles.Barto_at_[hidden]>
wrote:

> Yeah, that’s what I meant.
>
>
>
> My concern was with the “the scalar value of each source character shall
> be preserved” in the below
>
>
>
> “A UTF-8 file is a source file encoded with the UTF-8 encoding scheme
> defined in ISO/IEC 10646. An implementation shall support UTF-8 files. If
> the source file is determined to be a UTF-8 file, it shall represent a
> well-formed sequence of UTF-8 code units and the scalar value of each
> source character shall be preserved.”
>
>
>
> My concern is that if you write a string literal with Unicode characters
> in it and the compiler converts them to GB18030 that’s not “preserving the
> scalar value” I don’t understand translation phases very well, so feel free
> to tell me that’s somehow handled later on.
>

There is 2 steps.
Source -> Unicode scalar values, encoding non specified (phase 1)

Unicode -> literal encoding (in your question GB18030) (phase 5/6)

And efforts to make that clearer in the current wording but this behavior
is not new.

The wording you are concerned about is about the first phase only, and
intended to mean that when the compiler *reads* a source file, it should
not do any kind of transformations like normalization or replacement
character of any kind.

We have similar wording in another paper for the conversation to literals.


The effect is that with a utf-8 source file there is some guarantee that
the compiler will preserve the data through translation.


For any other source encoding to Unicode mapping is implementation defined.

Any other literal encoding conversion is implementation defined

In any case the source file encoding, the encoding in the compiler memory
and the encoding of literals (of which there are 5) are separate things.

Current wording has a tendency to use confusing terms :(



*From:* Peter Brett <pbrett_at_[hidden]>
> *Sent:* Thursday, April 29, 2021 3:34 AM
> *To:* Charlie Barto <Charles.Barto_at_[hidden]>
> *Cc:* Corentin <corentin.jabot_at_[hidden]>; sg16_at_[hidden]
> *Subject:* RE: [SG16] P2295R3 Support for UTF-8 as a portable source file
> encoding
>
>
>
> Hi Charlie,
>
>
>
> I’m going to assume that:
>
>
>
> - by ‘source character set’ you mean the encoding scheme of the source
> file
> - by ‘execution character set’ you mean the encoding scheme used for
> ordinary string literals in the compiled executable
>
>
>
> In that case, no – as I understand it this wording does not affect the
> conformance of an implementation where the literal encoding is GB18030.
> Please could you clarify what it was about the phase 1 changes that caused
> concern?
>
>
>
> Thanks!
>
>
>
> Peter
>
>
>
> *From:* SG16 <sg16-bounces_at_[hidden]> *On Behalf Of *Charlie Barto
> via SG16
> *Sent:* 29 April 2021 09:54
> *To:* sg16_at_[hidden]
> *Cc:* Charlie Barto <Charles.Barto_at_[hidden]>; Corentin <
> corentin.jabot_at_[hidden]>
> *Subject:* Re: [SG16] P2295R3 Support for UTF-8 as a portable source file
> encoding
>
>
>
> EXTERNAL MAIL
>
> Does that first change to lex.phases make the case where source character
> set is utf8 and execution character set is some oddball encoding (like
> gb18030) I'll formed *non-conforming*?
>
>
>
> Get Outlook for iOS
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.com%2Fv3%2F__https%3A%2Faka.ms%2Fo0ukef__%3B!!EHscmS1ygiU1lA!TGJVOeDR4D9YtxenASOJ-opVy7E39jQlKFuBmO063U90BTMPpwm-wrEAz5kvhQ%24&data=04%7C01%7CCharles.Barto%40microsoft.com%7C957da1d2471344c47b0308d90afa478a%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637552892619569701%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=ibibrW42Lzvm008B4gBF5Abnuyr5DTWIZUH5Xge5qJg%3D&reserved=0>
> ------------------------------
>
> *From:* SG16 <sg16-bounces_at_[hidden]> on behalf of Corentin via
> SG16 <sg16_at_[hidden]>
> *Sent:* Thursday, April 29, 2021 12:34:35 AM
> *To:* SG16 <sg16_at_[hidden]>
> *Cc:* Corentin <corentin.jabot_at_[hidden]>
> *Subject:* [SG16] P2295R3 Support for UTF-8 as a portable source file
> encoding
>
>
>
> Per request in yesterday's meeting,
>
> here is P2295R3 Support for UTF-8 as a portable source file encoding.
>
>
>
> I am looking forward to your feedback
>
>
>
> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2021/p2295r3.pdf
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.com%2Fv3%2F__https%3A%2Fnam06.safelinks.protection.outlook.com%2F%3Furl%3Dhttp*3A*2F*2Fwww.open-std.org*2Fjtc1*2Fsc22*2Fwg21*2Fdocs*2Fpapers*2F2021*2Fp2295r3.pdf%26data%3D04*7C01*7CCharles.Barto*40microsoft.com*7C16b7089d2ecf4d0bf73408d90ae14776*7C72f988bf86f141af91ab2d7cd011db47*7C1*7C0*7C637552785381773715*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C1000%26sdata%3DaXgu2D*2F4OkYKpVYZpJuOr5nB*2B*2F8lAwEyLq2*2Bnc*2FQxi4*3D%26reserved%3D0__%3BJSUlJSUlJSUlJSUlJSUlJSUlJSUlJSUlJSUlJQ!!EHscmS1ygiU1lA!TGJVOeDR4D9YtxenASOJ-opVy7E39jQlKFuBmO063U90BTMPpwm-wrHNaOEk1w%24&data=04%7C01%7CCharles.Barto%40microsoft.com%7C957da1d2471344c47b0308d90afa478a%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637552892619579654%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=1uNcyaWOQ%2FPt67iyywpyKxgO6Kmaqv7jfFFJPS%2B0DMA%3D&reserved=0>
>

Received on 2021-04-30 01:15:35