On Mon, 15 Jun 2020 at 18:17, Tom Honermann <tom@honermann.net> wrote:
On 6/15/20 12:11 PM, Corentin Jabot wrote:


On Mon, 15 Jun 2020 at 18:06, Tom Honermann <tom@honermann.net> wrote:
On 6/15/20 11:47 AM, Corentin Jabot wrote:


On Mon, 15 Jun 2020 at 17:30, Tom Honermann <tom@honermann.net> wrote:
On 6/15/20 4:41 AM, Corentin Jabot via SG16 wrote:


On Mon, 15 Jun 2020 at 09:00, Jens Maurer <Jens.Maurer@gmx.net> wrote:
On 15/06/2020 00.06, Hubert Tong wrote:
> The presence of a UCN for a C1 (non-EBCDIC) control character in a supposedly-EBCDIC string is not immediately indicative of an error.
In this example, is the UCN intending to mean the conventionally mapped
EBCDIC control character, or something else?

Beyond EBCDIC control characters, do we know of any other situation
where input-to-Unicode mapping is not semantics-preserving or lossy?
It would be good to keep a list in one of the upcoming papers, for
the permanent record.

There are 3 scenarios I can think of:
  • The control characters for EBCDIC , but also other encodings that have more control characters beyond what exists in ascii, all of that maps to C0/C1 in an application specific manner
  • Some (~20) GB 10 830 characters map to the unicode private use area which also doesn't "preserve semantic"
  • Some Big5 characters do not have a unicode mapping at all ( that is exclusively place and people names, and for example doesn't concern the windows big 5 code pages)

There are also the characters that have duplicate code point assignments in Shift-JIS such that one of them won't round trip through Unicode.  It sounds like GB 18030 has one such case as well.

Semantic preserving is different from round trippable
Absolutely; we know of cases where semantics are preserved and round-tripping is not, and of different cases where round-tripping is preserved but semantics are not.

Consider a source character S1, U its internal representation and C1, C2 to possible representation of that character in the execution encoding
The two following mapping are valid, and preserve the semantic in phase 1 and 5.

S1 -> U -> C1
S1 -> U -> C2

It isn't observable from within the program which mapping was chosen and therefore an implementation could choose to prefer
the mapping that happens to have the same byte value as in source.

The Shift-JIS case is where characters S1 and S2 both map to the same C in U:

S1 -> U -> C
S2 -> U -> C

That difference is (expected to be) observable in raw string literals.

That behavior should, imo, neither be prescribed nor prevented.

While the _wording_ loses information about the source encoding after phase 1, it doesn't mean that an implementation has to pretend it doesn't
have perfect information when considering this scenario (but prescribing it would severely reduce implementation freedom and wouldn't match existing practices, which we should avoid).

I believe we agree here.  The problem is that the wording prevents discussing the scenario in formal terms.


What would be the benefit of discussing that in the wording ?

Avoiding long email threads and confusion about what the heck the standard is specifying :)

Concretely you would want a note that when multiple mappings are possible, it is implementation defined which is chosen?
 

Tom.