Date: Sat, 30 Jan 2021 20:26:11 +0100
On 30/01/2021 20.16, Corentin via SG16 wrote:
basic execution character set shall be represented in each locale-specific encoding.
>
>
> I think we want to say ( to match existing practice ), that the execution environment has an encoding / character set that is either the same or a super set of the execution character set (same values but may have extra members).
> It is unclear that "local specific" currently say that.
>
> I don't think the encoding interpretation of the above (which I think was the intended interpretation) actually matches existing practice (except perhaps for the "C" locale). That different locales present in runtime environments may encode characters within the basic execution character set differently is a practical reality (web search for "PPCS variant characters").
>
>
> Unfortunately, when that's the case (and I agree that's the case more often than we'd like, another good example is shift-jis/win-1251), string literals cannot be interpreted properly by "locale specific" runtime functions.
> Such runtime function expects an encoding that is not the same as the string literal, it cannot interpret it correctly, which can lead to mojibake, etc.
From a core language perspective, we have a compile-time encoding for literals
(i.e. mapping of character sequences inside literals to code unit sequences).
The actual execution environment of the program (possibly conveyed via locale)
might not be compatible with that. For the core language, I think we should
simply replace "execution character set" with "literal encoding" (narrow and wide),
because we never actually care about character sets, just about encoding,
i.e. a sequence of code units with which to initialize a string literal object.
Maybe locale-dependent library functions just need to get a divorce from that.
Jens
basic execution character set shall be represented in each locale-specific encoding.
>
>
> I think we want to say ( to match existing practice ), that the execution environment has an encoding / character set that is either the same or a super set of the execution character set (same values but may have extra members).
> It is unclear that "local specific" currently say that.
>
> I don't think the encoding interpretation of the above (which I think was the intended interpretation) actually matches existing practice (except perhaps for the "C" locale). That different locales present in runtime environments may encode characters within the basic execution character set differently is a practical reality (web search for "PPCS variant characters").
>
>
> Unfortunately, when that's the case (and I agree that's the case more often than we'd like, another good example is shift-jis/win-1251), string literals cannot be interpreted properly by "locale specific" runtime functions.
> Such runtime function expects an encoding that is not the same as the string literal, it cannot interpret it correctly, which can lead to mojibake, etc.
From a core language perspective, we have a compile-time encoding for literals
(i.e. mapping of character sequences inside literals to code unit sequences).
The actual execution environment of the program (possibly conveyed via locale)
might not be compatible with that. For the core language, I think we should
simply replace "execution character set" with "literal encoding" (narrow and wide),
because we never actually care about character sets, just about encoding,
i.e. a sequence of code units with which to initialize a string literal object.
Maybe locale-dependent library functions just need to get a divorce from that.
Jens
Received on 2021-01-30 13:26:16