On Sat, Sep 18, 2021 at 3:36 AM Corentin <corentin.jabot@gmail.com> wrote:


On Fri, Sep 17, 2021 at 11:24 PM Hubert Tong <hubert.reinterpretcast@gmail.com> wrote:
On Thu, Sep 16, 2021 at 5:04 PM Corentin <corentin.jabot@gmail.com> wrote:


On Thu, Sep 16, 2021 at 10:23 PM Hubert Tong <hubert.reinterpretcast@gmail.com> wrote:
On Thu, Sep 16, 2021 at 1:42 PM Corentin <corentin.jabot@gmail.com> wrote:


Given the many many encodings, a lot are only separated by one or two codepoints. How in that context do we define variants?

I think you are raising a new question about the design: If a system has more than one implementation of the same encoding, should it return the same registered character set to represent more than one of those encodings?

Well,  2 implementations of the same encoding would be the same, and for registered character sets, aliases would cover that use case.

Sorry (my fault for not being clear): I meant the question not for the "same encoding" but for encodings separated by a small percentage of differences. For example, near matches for a registered character set but one being modified for the euro sign. 

No need to apologize!
Whether an implementation would choose to consider a specific encoding to advertise itself as another will depend a lot on what the current practices on that specific platform are.
Most of the "european" encodings that have been modified for euro have been restandardized and republished under a different name (8859-15 and 8859-1 for example).

I think I can work with that; thanks.
 

P1885 does not exist in a vacuum. And the existing wording does place a requirement between the narrow and wide execution encodings. I am somewhat convinced that P1885 is not the place to address the wchar_t problems re: UCS-2 versus UTF-16, but I will point out that P1885 theoretically exacerbates the problem because the UTF-8 narrow and UCS-2 wide case is not perfectly valid. Previously, only the UTF-16 case was clearly misaligned with the standard; with P1886, the UCS-2 case is also misaligned.

Sure, although I am not aware of platforms for which UCS-2 is currently assumed.

AIX is considered to use UCS-2 for the 2-byte version of `wchar_t` (Clang and GCC may use UTF-16 in place of UCS-2 in encoding literals). The same for z/OS. It is one of those implementations where `wcstombs` for some locales gives you CESU-8 encoded output when surrogate pairs are encountered (but `mbrtoc32` won't interpret the surrogate pairs, so the narrow encoding is not CESU-8).
 
 
 
  • Return an encoding that is different from that used by mbstowcs, for example (the wide environment is rather an environment that you would expect wprintf could consume without creating mojibake)
This particular intent has additional limitations: The understanding of locales with the same name is not consistent in practice on various platforms between 32-bit and 64-bit processes.

Do you have specific examples in mind? I am not aware of platforms where wchar_t would be 64 bits. Or maybe the size of wchar_t is not your concern, can you clarify?

The size of wchar_t for the 31/32-bit ABI (which has more history) and for the 64-bit ABI is different on z/OS and on AIX. For both platforms, wchar_t is 2 bytes for a 31/32-bit process and 4 bytes for a 64-bit process.
 
User expectations of something completely novel is rather hard to guess at. Should the narrow and wide EBCDIC versions of the same character set be called the same charset? For cases where there are no multibyte characters, most indications are "yes". For cases where there are multibyte characters, it seems to be more up in the air. If the answer is "no", then I imagine we end up with some "x-" prefixes and "-2byte" or "-4byte" suffixes (the endianness is always big endian).

Despite poor naming, IANA specifically registers encodings, hence the class `text_encoding`.
In your scenario, narrow and wide EBCDIC would have different sequences of code units and different code unit types and as such would be ideally considered different encodings.

For EBCDIC without multibyte characters in the narrow encoding, the sequences of code units are the same (if ignoring the code unit type difference and correcting for endianness when read from a file). The guidance that they are ideally considered different encodings is reasonable and I would suggest that the paper's wide EBCDIC example show an unregistered "x-EBCDIC-US-4byte" as a potential answer. I suppose being implicitly "native endian" makes some sense.
 
I will admit it has been difficult for me to find any information about wide-ebcdic, so i don't know if and how it is currently referred to by IBM implementations, if there are multiple encoding and character sets/encodings defined as some flavor of wide ebcid, etc

As far as I know, these encodings just exist and are not talked about or really named as something separate from the coded character set.
 


The placement of the "implementation-defined" in the currently proposed wording for environment() is hard for me to read this way. Also, the wording for literal() does not say "implementation-defined".

The literal encoding is already implementation-defined, and we do not need as much implementation freedom here.
I am happy to reword that sentence if you think it would be clearer.

I think that wording is fine with the addition below.
 
  
Minimal wording (for the general/synopsis section):
How a text_encoding object is determined to be representative of a character encoding implemented in the translation or execution environment is implementation-defined.

I would be happy to add that

Thanks.
 
 
I would again emphasize that some of the issues are with the novelty of trying to name wide encodings where there has not been sufficient need that there is established existing practice. Do you have a list of existing APIs that provide names for wide encodings out of locale information?

On Windows this is a documented property of the platform: UTF-16.
On some posix platforms (linux, mac), this will always be UTF-32.
On others (like freebsd), it is not documented beyond being "implementation-defined", and may be some wide shift-jis or fixed-width euc. There is currently no api to infer what these wide encodings are. 

Possible implementation strategies for freebsd include
  • Maintaining a mapping of narrow -> wide encodings, as these platforms are specified to have one.
User-defined locales on AIX can mess with that. The wide encoding (when not UCS-2 or UTF-32) is realized by replacement implementations of mbstowcs, wcstombs, etc.
  • Modifying the libc to expose the name of wide encoding in a way similar to nl_lang_info(CODESET) - for use by the C++ library. I hope this will be the long term outcome.
  • return id::unknown - which is what I expect these platforms to do initially. 
I suspect the answer for ebcdic platforms might be very similar? 

Pretty much the mapping case.

I'm not sure that any of the strategies above other than "unknown" avoids inventing new names for wide encodings. It seems the title of the paper is more applicable than I had appreciated until now. :)