One thing I have realized while working on identifiers is that after conversion from whatever the sources are, lexing and parsing are symbolic. That is, 'a' doesn't have a value until it's rendered into a literal. That is "
The values of the members of the execution character sets and the sets of additional members are locale-specific."
http://eel.is/c++draft/lex.charset#3.sentence-5 really only comes into play when rendering the "execution character set" into a characters or strings. The execution character set and the source character set exist in the same logical space right now, and the "source character set" isn't what is in source files today.
Yep, and they don't have to have a value either. identifiers are not sorted etc.
Everything in lex is symbolic anyway the phases don't exist in practice.
However, the international representation being isomorphic to Unicode, it would be possible to describe in term of unicode with no observable behavior change.
I would like to allow characters not present in Unicode within character literals, string literals, comments, and header names. More abstractly, I would like to allow source -> encoding-used-for-output conversion.
Do you have an example of a use case you want to support?
I am still evaluating the round-trip mapping for EBCDIC.
The tricky part is the control characters, which this TR maps to the C1 unicode control characters
I'm not questioning the ability to round-trip. I am questioning the ability to avoid conflating certain EBCDIC control characters with certain C1 control characters. For example, it seems clear to me that U+0096 START OF GUARDED AREA and U+0097 END OF GUARDED AREA are paired in the intended usage, but the mapping of these to, respectively, Numeric Backspace and Graphic Escape does not retain semantic meaning. If such EBCDIC characters appear within a literal that should be encoded in a Unicode encoding, I find it potentially questionable if the string is considered well-formed. I have similar thoughts for the case where a UCN escape for such a C1 control character appears in a string that is to be encoded in EBCDIC.
In other words, I do not consider the mapping (which is useful if you track out-of-band whether the data was originally EBCDIC or not) to establish the presence of the EBCDIC control characters in Unicode. These opinions do not necessarily represent those of IBM.
-- HT