C++ Logo

sg16

Advanced search

Re: [isocpp-core] P2295 Support for UTF-8 as a portable source file encoding

From: Hubert Tong <hubert.reinterpretcast_at_[hidden]>
Date: Fri, 10 Jun 2022 11:05:21 -0400
I've merged the suggestions (add "physical", use the parenthetical for the
non-UTF-8 case, use plural form for designating, have wider-scope
implementation-defined wording for non-UTF-8 case that encompasses the
permission from the parenthetical):

An implementation shall support physical source files that are a sequence
of UTF-8 code units (UTF-8 source files). It may also support an
implementation-defined set of other kinds of physical source files, and, if
so, the kind of a physical source file is determined in an
implementation-defined manner, which includes a means of designating
physical source files as UTF-8 source files, independent of their content.
[Note: In other words, recognizing the U+FEFF Byte Order Mark is not
sufficient. --end note]

If a physical source file is designated or otherwise determined to be a
UTF-8 source file, then it shall be a well-formed UTF-8 code unit sequence
and it is decoded to produce a sequence of UCS scalar values that
constitutes the sequence of elements of the translation character set. For
any other kind of physical source file supported by the implementation,
characters are mapped, in an implementation-defined manner, to a sequence
of translation character set elements (introducing new-line characters for
end-of-line indicators).

On Fri, Jun 10, 2022 at 10:42 AM Hubert Tong <
hubert.reinterpretcast_at_[hidden]> wrote:

> On Fri, Jun 10, 2022 at 4:02 AM Corentin <corentin.jabot_at_[hidden]> wrote:
>
>> I'm concerned that this approach will be hard to understand by people who
>> have not followed the discussions, on top of preexisting obfuscations (the
>> translation set indirection).
>> It's also very repetitive but maybe we can massage that a bit.
>> Lastly, I really don't like the " There are no end-of-line indicators
>> apart from the content of the UTF-8 code unit sequence" which is more
>> confusing than enlightening.
>>
>
> This is extremely relevant if you consider that "text" being a sequence of
> characters without structure is not the only way you can look at text.
>
>
>> It's also unfortunate that the utf-8-ness is tied to a medium rather than
>> the content, and that we can't agree that source code is text, or that any
>> textual data consumed by an implementation has an associated encoding.
>>
>
> What we do not seem to agree on is whether or not "text" can be taken as
> structured by lines and the such.
>
> I truly am trying to convey the intent of the paper through to places
> where certain assumptions about the nature of text files do not match the
> native ones. If the wording does not include hooks to point out that
> certain paradigms are not meant to extend into the world of portable, UTF-8
> source code, then we'll likely end up with "UTF-8 source code" that isn't
> portable. It would not be caused by "hostility" from any party, merely a
> failure of the wording to clarify the intent.
>

I guess the wording does cover that intent now. An UTF-8 source file is
defined purely to be the sequence.

Received on 2022-06-10 15:05:50