Date: Mon, 15 Jun 2020 13:14:28 +0200
Hey.
It occured to me that while we discuss about terminology, we have
not clearly stated our
objectives
Ultimately the observed behavior should be identical in most respect to
what implementers do today and in particular we don't want to change the
mapping for a well formed program, to avoid silent change, and we want to
keep supporting the set of source encoding supported by existing
implementations.
But (I think) we should
- Tighten the specification to describe a semantic rather than
implementation defined mapping, while also making sure that mappings
prescribed by vendors, or Unicode, such as UTF-EBCDIC are
not accidentally forbidden.
- Make the program ill-formed if the source file contains invalid code unit
sequences or unmappable characters (the later of which is not a concern for
existing implementations)
- Specify that unicode encoded (and especially utf-8 encoded) files
preserve the sequence of code points in phase 1 (aka no normalization)
- Mandate support for utf-8 files
- Find a better wording mechanism for the handling of raw string literals,
which most likely means that we want to handle universal-character names
that appear in the source files as verbatim escape sequences differently
than what phase 1 calls extended characters.
- Not actively prevent, nor mandate, that, if the source and the execution
or wide execution encoding are the same, the sequence of bytes in string
literal is preserved.
Did I miss anything?
It occured to me that while we discuss about terminology, we have
not clearly stated our
objectives
Ultimately the observed behavior should be identical in most respect to
what implementers do today and in particular we don't want to change the
mapping for a well formed program, to avoid silent change, and we want to
keep supporting the set of source encoding supported by existing
implementations.
But (I think) we should
- Tighten the specification to describe a semantic rather than
implementation defined mapping, while also making sure that mappings
prescribed by vendors, or Unicode, such as UTF-EBCDIC are
not accidentally forbidden.
- Make the program ill-formed if the source file contains invalid code unit
sequences or unmappable characters (the later of which is not a concern for
existing implementations)
- Specify that unicode encoded (and especially utf-8 encoded) files
preserve the sequence of code points in phase 1 (aka no normalization)
- Mandate support for utf-8 files
- Find a better wording mechanism for the handling of raw string literals,
which most likely means that we want to handle universal-character names
that appear in the source files as verbatim escape sequences differently
than what phase 1 calls extended characters.
- Not actively prevent, nor mandate, that, if the source and the execution
or wide execution encoding are the same, the sequence of bytes in string
literal is preserved.
Did I miss anything?
Received on 2020-06-15 06:17:49