ISOCPP sg16 List: [isocpp-sg16] Use cases for user construction of text

From: Henri Sivonen <hsivonen_at_[hidden]>
Date: Sun, 21 Jul 2024 14:53:06 +0300

Via LWN, I noticed that https://isocpp.org/files/papers/P1885R8.pdf is on track for C++26.

The paper says: "To support other use cases such as interoperability with other libraries or internet protocols,
text_encoding can be constructed by users"

When is such construction by user the right answer to a use case faced by a C++ programmer?

Consider the following uses cases:

1. The program downloads a textual resource from the Web. What should the program use to decode the character encoding aspect of the resource?

Correct answer: An implementation of the WHATWG Encoding Standard.

2. The program reads email from mbox files. What should the program use to decode character encoding aspect of the emails?

Likely correct answer: An implementation of the WHATWG Encoding Standard augmented with recognition of java.io names for JavaMail compatibility and augmented with an implementation of UTF-7.

3. The program reads legacy Excel files. What should the program use to decode the textual aspects given a Windows code page number embedded in the file?

Correct answer: MultiByteToWideChar from Kernel32.dll on Windows, or another implementation that recognizes the Windows code page numbers that can occur as the Windows system code page and that implements close enough decoding semantics.

To what programming question the right answer is "Construct C++ standard-library text_encoding by name"? Is the situation more common than the above three situations?

Suppose in cases 1 and 2 above there's charset=EUC-KR. Let's suppose in case 3 the code page number is 949. How do you represent this using the text_encoding API?

The IANA charset registry maps EUC-KR to https://www.rfc-editor.org/rfc/rfc1557.html . That RFC does not describe the interoperable implementation requirements for the above three use cases. What's called for is https://encoding.spec.whatwg.org/#euc-kr-decoder or an exact implementation of Windows code page 949. The two differ by WHATWG EUC-KR not having the EUDC range of code page 949, which is typically not practically relevant.

https://isocpp.org/files/papers/P1885R8.pdf compares the WHATWG and IANA encoding lists on the level of name similarity instead of comparing actual encoding definitions.

Notably, the closest IANA encoding to WHATWG Shift_JIS is IANA windows-31j (not IANA Shift_JIS) and, AFAICT, the closest IANA encoding to WHATWG Big5 is IANA Big5-HKSCS (not IANA Big5). In the case of IANA GBK, WHATWG decoding side is GB18030 (and encoder is GBK). (Also Big5 and EUC-JP have asymmetric encoders in the WHATWG Encoding Standard.)

AFAICT, IANA doesn't have any encoding that would be to IANA EUC-KR what IANA windows-31j is to IANA Shift_JIS. That is, an IANA registration corresponding to Windows code page 949 is missing.

IANA wouldn't change the existing registration to match practice and also wouldn't register new names that Microsoft was unlikely to implement: https://data.iana.org/archive/ietf-charsets/msg01698.html .

Apart from user construction, what's std::text_encoding::environment() expected to return on Windows that was installed with Korean as the language (i.e. the Windows legacy code page is 949)?

-- 
Henri Sivonen (in personal capacity; no work context implied)

Received on 2024-07-21 11:54:07