On Tue, Jun 9, 2020 at 9:03 PM Corentin Jabot <corentinjabot@gmail.com> wrote:

On Wed, 10 Jun 2020 at 01:39, Hubert Tong <hubert.reinterpretcast@gmail.com> wrote:
On Tue, Jun 9, 2020 at 7:12 PM Steve Downey <sdowney@gmail.com> wrote:
While I understand what you are asking for, and I agree it doesn't seem unreasonable, I also don't see how that it works with the machinery today?
I am not saying that the C++ wording today works for this by the letter (except for heavy-handed interpretations of phase 1). I consider it to be a bug that it doesn't.

All characters outside the basic source character set are mapped to universal-character-names that are named by Unicode scalar values.
We'd need a mechanism to get back to the completely untranslated original source.

I think we have that mechanism already.
We have a mapping source -> universal-character-names (which for your interest is specified both by IBM and Unicode), and the universal-character-names -> execution mapping, which again is fully specified.
I think that is enough to do, if desired, a direct source -> execution which is bytes preserving, as it is not observable whether it was done or not.

It is round-trippable but at the cost of one-way (during compilation) conversions that are not semantically preserving. Even these are justifiable, but I think they deserve to be called out. Which is to say that the paper should document that these concerns were considered and not simply dismiss the issue.

'\u0096' becoming '\x36': I suppose this could be justified for the case where the user application is expected to have its output subjected to automatic conversion, e.g., via SSH to a non-EBCDIC terminal.

For the much rarer case of u'<0x36>' (character literal that, in the physical source file, contains the EBCDIC control character) becoming u'\u0096': I suppose this could be justified for the case where the user source was originally non-EBCDIC, but subjected to conversion into EBCDIC.

I think this is similar to how raw string literals need some sort of mechanism.

On Tue, Jun 9, 2020, 18:32 Hubert Tong <hubert.reinterpretcast@gmail.com> wrote:
On Tue, Jun 9, 2020 at 5:21 PM Corentin Jabot <corentinjabot@gmail.com> wrote:

On Tue, 9 Jun 2020 at 23:06, Hubert Tong <hubert.reinterpretcast@gmail.com> wrote:
On Tue, Jun 9, 2020 at 4:59 PM Corentin Jabot <corentinjabot@gmail.com> wrote:

On Tue, 9 Jun 2020 at 22:17, Hubert Tong <hubert.reinterpretcast@gmail.com> wrote:
On Tue, Jun 9, 2020 at 1:01 PM Corentin Jabot via SG16 <sg16@lists.isocpp.org> wrote:

On Tue, 9 Jun 2020 at 18:45, Steve Downey <sdowney@gmail.com> wrote:
One thing I have realized while working on identifiers is that after conversion from whatever the sources are, lexing and parsing are symbolic. That is, 'a' doesn't have a value until it's rendered into a literal. That is " The values of the members of the execution character sets and the sets of additional members are locale-specific." http://eel.is/c++draft/lex.charset#3.sentence-5 really only comes into play when rendering the "execution character set" into a characters or strings. The execution character set and the source character set exist in the same logical space right now, and the "source character set" isn't what is in source files today.

Yep, and they don't have to have a value either. identifiers are not sorted etc.
Everything in lex is symbolic anyway the phases don't exist in practice.
However, the international representation being isomorphic to Unicode, it would be possible to describe in term of unicode with no observable behavior change.
I would like to allow characters not present in Unicode within character literals, string literals, comments, and header names. More abstractly, I would like to allow source -> encoding-used-for-output conversion.

Do you have an example of a use case you want to support?
I am still evaluating the round-trip mapping for EBCDIC.

I believe Unicode -> EBCDIC round trip perfectly using the process described in https://www.unicode.org/reports/tr16/tr16-8.html
The tricky part is the control characters, which this TR maps to the C1 unicode control characters
I'm not questioning the ability to round-trip. I am questioning the ability to avoid conflating certain EBCDIC control characters with certain C1 control characters. For example, it seems clear to me that U+0096 START OF GUARDED AREA and U+0097 END OF GUARDED AREA are paired in the intended usage, but the mapping of these to, respectively, Numeric Backspace and Graphic Escape does not retain semantic meaning. If such EBCDIC characters appear within a literal that should be encoded in a Unicode encoding, I find it potentially questionable if the string is considered well-formed. I have similar thoughts for the case where a UCN escape for such a C1 control character appears in a string that is to be encoded in EBCDIC.

In other words, I do not consider the mapping (which is useful if you track out-of-band whether the data was originally EBCDIC or not) to establish the presence of the EBCDIC control characters in Unicode. These opinions do not necessarily represent those of IBM.

-- HT