On Mon, Jun 14, 2021 at 12:44 PM Peter Brett <email@example.com> wrote:
Thank you for all the helpful feedback!
In the wording, I am attempting to draw a distinction between “what is the encoding scheme associated with the file” and “what does the file actually contain.” As an analogy, a C++ source file is still a C++ source file even when it contains syntax errors that prevent it compiling; for example, deleting the last ‘}’ in a C++ source file doesn’t stop it from being a C++ source file. As another analogy, the encoding scheme associated with a string literal is the literal encoding, but the string literal does not actually have to be valid with respect to the that encoding scheme.
I think we need to explicitly state that there is a way for a user to tell the compiler that a source file is UTF-8. This is to make sure that implementations cannot have, “I’ll look at the file contents and guess,” as the only mechanism for determining the encoding. Several SG-16 participants have said that it is absolutely essential to have a way to tell the compiler, “No, I am totally convinced that I am giving you UTF-8 and I want you to produce an error if it isn’t.”How about saying that then?The encoding scheme of a source file is determined in an implementation-defined manner. An implementation shall provide a mechanism to determine the encoding of a source file that is independent of its content.
I’m going to tweak the wording to say that we ‘associate’ an encoding with the source file.
I’m then attempting to say that UTF-8 source files actually have to contain UTF-8, and also that there is absolutely no “mapping” involved; the contents of the source files is already ready for phase 2 (i.e. it is *already* in the translation character set).
Finally, I’ve left the wording w.r.t. “anything else” completely unchanged, so that it remains clear that implementations don’t have to change the EBCDIC/ISO-8859-1/Big5 path through phase 1 after this paper is applied.
I agree that this wording definitely contains more words than necessary and could eventually go on a diet, but I’m currently trying to be very clear rather than concise. I don’t mind using as much repetition and/or redundancy as necessary in order to be unambiguous.
Here’s a new proposed wording based on P2314, and I hope you think it is an improvement:
- An encoding scheme is associated with a physical source file in an implementation defined manner. An implementation shall support the UTF-8 encoding scheme. An implementation shall define a mechanism for specifying that UTF-8 is the encoding scheme associated with a physical source file.
If a physical source file’s associated encoding scheme is UTF-8, then it shall be a well-formed sequence of translation character set elements encoded as UTF-8 code units. [ Note 1: The result of phase 1 is the exact sequence of UCS scalar values present in the file, with no substitutions, modifications or corrections. — end note]
If a physical source file’s associated encoding scheme is not UTF-8, then physical source file characters are mapped, in an implementation-defined manner, to the translation character set (introducing new-line characters for end-of-line indicators). The set of physical source file characters accepted is implementation-defined.
- If the first character is U+FEFF BYTE ORDER MARK, it is deleted. ...
I’m not sure we can cut this down without introducing ambiguities or removing important elements. If you’re still unhappy with this, then I guess we’re stuck. Maybe someone else can have a go.