Hey.
It occured to me that while we discuss about terminology, we have not clearly stated our
objectives
Ultimately the observed behavior should be identical in most respect to what implementers do today and in particular we don't want to change the mapping for a well formed program, to avoid silent change, and we want to keep supporting the set of source encoding supported by existing implementations.
But (I think) we should
- Tighten the specification to describe a semantic rather than implementation defined mapping, while also making sure that mappings prescribed by vendors, or Unicode, such as UTF-EBCDIC are not accidentally forbidden.
- Make the program ill-formed if the source file contains invalid code unit sequences or unmappable characters (the later of which is not a concern for existing implementations)
- Specify that unicode encoded (and especially utf-8 encoded) files preserve the sequence of code points in phase 1 (aka no normalization)
- Mandate support for utf-8 files
- Find a better wording mechanism for the handling of raw string literals, which most likely means that we want to handle universal-character names that appear in the source files as verbatim escape sequences differently than what phase 1 calls extended characters.
- Not actively prevent, nor mandate, that, if the source and the execution or wide execution encoding are the same, the sequence of bytes in string literal is preserved.
Did I miss anything?