Date: Mon, 22 Jul 2024 23:58:08 +0300
How does C++26 exposing the IANA names and aliases help interoperability? The lesson from the work that resulted in the WHATWG Encoding Standard is that using the IANA semantics (either what decoding procedure the labels mean or how the labels are matched) results in worse compatibility with existing content than promoting labels to mean the superset encodings of the IANA encodings and matching labels without ignoring hyphens and underscores. (Opera figured out that the ignoring the hyphens and underscores in matching wasn’t Web-compatible in the Presto days.)
As for having an API for querying the IANA identity of the execution encoding, how does that benefit interop? Surely the most interoperable approach is to use UTF-8 unconditionally and on Windows to use the property in exe metadata to make the terminal treat the output stream from the exe as UTF-8 even when the chcp-reported terminal state is something else. By the time C++26 ships, all still-supported general-availability branches of Windows will support https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page .
On Mon, Jul 22, 2024, at 1:53 PM, Corentin Jabot wrote:
> This is exactly why names exist and are important.
> IANA give you a set of registered names that can be used for interoperability (although, as Henri pointed out, the interpretation of these names can in some cases vary by platforms, and mapping from names to an actual mapping requires cares)
> Users provided names, along with aliases let users and platforms support non-registered, non-portable encodings.
>
>
> On Sun, Jul 21, 2024 at 10:21 PM Henri Sivonen via SG16 <sg16_at_[hidden]> wrote:
>> __
>> What’s the canonical string name for Windows code page 949? What about 950? The pattern-following answers windows-949 and windows-950 would mean not using IANA naming as registered. (java.nio uses x-windows-949 and x-windows-950. See https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html . That page shows a problem with IANA names: Java started using x-windows-874, but then windows-874 was registered, but by then changing the canonical name would have been an API break.)
>>
>> On Sun, Jul 21, 2024, at 10:14 PM, Tiago Freire wrote:
>>> I'm not a fan of mapping encodings to numbers.
>>> I don't t see a point of throwing at it every single encoding and the kitchen sink, most of them should be obsolete anyhow, pretty sure there will still be something that a user might want to do that would be left out.
>>> But these encodings have names, why not compile time strings as identifiers? Probably those names would be there regardless, why not have them pull double duty?
>>>
>>>
>>> *From:* SG16 <sg16-bounces_at_[hidden]> on behalf of Thiago Macieira via SG16 <sg16_at_[hidden]>
>>> *Sent:* Sunday, July 21, 2024 8:45:15 PM
>>> *To:* SG16 <sg16_at_[hidden]>; Henri Sivonen <hsivonen_at_[hidden]>
>>> *Cc:* Thiago Macieira <thiago_at_[hidden]>
>>> *Subject:* Re: [isocpp-sg16] Use cases for user construction of text_encoding by name
>>>
>>> On Sunday 21 July 2024 11:39:54 GMT-7 Henri Sivonen wrote:
>>> > > Is there such an 1:1 mapping?
>>> >
>>> > I believe not: Windows code pages 950 (Traditional Chinese) and 949 (Korean)
>>> > don't appear to have IANA registrations. They differ from Big5 and EUC-KR
>>> > in a way analogous to how windows-1252 differs from ISO-8859-1, how
>>> > windows-31j differs from Shift_JIS, and how GBK differs from GB2312.
>>>
>>> How about some other identifier roster that has such 1:1 mapping? We don't have
>>> to use the IANA registrations, we can specify one that works.
>>>
>>> --
>>> Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
>>> Principal Engineer - Intel DCAI Platform & System Engineering
>>>
>>>
>>>
>>> --
>>> SG16 mailing list
>>> SG16_at_[hidden]
>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>
>> --
>> Henri Sivonen
>>
>> --
>> SG16 mailing list
>> SG16_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
As for having an API for querying the IANA identity of the execution encoding, how does that benefit interop? Surely the most interoperable approach is to use UTF-8 unconditionally and on Windows to use the property in exe metadata to make the terminal treat the output stream from the exe as UTF-8 even when the chcp-reported terminal state is something else. By the time C++26 ships, all still-supported general-availability branches of Windows will support https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page .
On Mon, Jul 22, 2024, at 1:53 PM, Corentin Jabot wrote:
> This is exactly why names exist and are important.
> IANA give you a set of registered names that can be used for interoperability (although, as Henri pointed out, the interpretation of these names can in some cases vary by platforms, and mapping from names to an actual mapping requires cares)
> Users provided names, along with aliases let users and platforms support non-registered, non-portable encodings.
>
>
> On Sun, Jul 21, 2024 at 10:21 PM Henri Sivonen via SG16 <sg16_at_[hidden]> wrote:
>> __
>> What’s the canonical string name for Windows code page 949? What about 950? The pattern-following answers windows-949 and windows-950 would mean not using IANA naming as registered. (java.nio uses x-windows-949 and x-windows-950. See https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html . That page shows a problem with IANA names: Java started using x-windows-874, but then windows-874 was registered, but by then changing the canonical name would have been an API break.)
>>
>> On Sun, Jul 21, 2024, at 10:14 PM, Tiago Freire wrote:
>>> I'm not a fan of mapping encodings to numbers.
>>> I don't t see a point of throwing at it every single encoding and the kitchen sink, most of them should be obsolete anyhow, pretty sure there will still be something that a user might want to do that would be left out.
>>> But these encodings have names, why not compile time strings as identifiers? Probably those names would be there regardless, why not have them pull double duty?
>>>
>>>
>>> *From:* SG16 <sg16-bounces_at_[hidden]> on behalf of Thiago Macieira via SG16 <sg16_at_[hidden]>
>>> *Sent:* Sunday, July 21, 2024 8:45:15 PM
>>> *To:* SG16 <sg16_at_[hidden]>; Henri Sivonen <hsivonen_at_[hidden]>
>>> *Cc:* Thiago Macieira <thiago_at_[hidden]>
>>> *Subject:* Re: [isocpp-sg16] Use cases for user construction of text_encoding by name
>>>
>>> On Sunday 21 July 2024 11:39:54 GMT-7 Henri Sivonen wrote:
>>> > > Is there such an 1:1 mapping?
>>> >
>>> > I believe not: Windows code pages 950 (Traditional Chinese) and 949 (Korean)
>>> > don't appear to have IANA registrations. They differ from Big5 and EUC-KR
>>> > in a way analogous to how windows-1252 differs from ISO-8859-1, how
>>> > windows-31j differs from Shift_JIS, and how GBK differs from GB2312.
>>>
>>> How about some other identifier roster that has such 1:1 mapping? We don't have
>>> to use the IANA registrations, we can specify one that works.
>>>
>>> --
>>> Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
>>> Principal Engineer - Intel DCAI Platform & System Engineering
>>>
>>>
>>>
>>> --
>>> SG16 mailing list
>>> SG16_at_[hidden]
>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>
>> --
>> Henri Sivonen
>>
>> --
>> SG16 mailing list
>> SG16_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
-- Henri Sivonen
Received on 2024-07-22 20:58:56