C++ Logo

sg16

Advanced search

Re: QoI for escaped formatting of non-Unicode-encoding strings: deployment overhead versus ideal behaviour

From: Hubert Tong <hubert.reinterpretcast_at_[hidden]>
Date: Sun, 16 Jul 2023 11:10:07 -0400
On Sun, Jul 16, 2023 at 10:29 AM Victor Zverovich <
victor.zverovich_at_[hidden]> wrote:

> Failures and unsupported encodings can always be handled by falling back
> on overescaping.
>

Thanks. That is useful clarification in terms of expectations (I did
understand that, conformance-wise, that is allowed).


>
> > Additionally, I am not sure that the policy chosen for unassigned
> codepoints should be the same between Unicode and non-Unicode encodings.
>
> Could you elaborate?
>

My statement arose because I was thinking that Unicode can assign
codepoints in the future (and other encodings, not so much); however, I
failed to account for pre-existing extensions for various non-Unicode
encodings having differences in the assigned area of the codespace (and for
user-defined cases).

My understanding of the status quo is that the intent is for unassigned
codepoints to be passed through unescaped.


>
> Cheers,
> Victor
>
> On Wed, Jul 12, 2023 at 4:38 PM Hubert Tong via SG16 <
> sg16_at_[hidden]> wrote:
>
>> Hi SG 16:
>>
>> When escaping strings (an operation likely done at runtime), some
>> information about the literal encoding (a property of the compilation
>> environment) is needed.
>>
>> For "ideal behaviour", it seems to me that the ability to
>> hardcode/capture at compile time/deploy with the runtime is needed for the
>> following:
>> 1. Understanding of the encoding scheme (e.g., valid initial code units,
>> valid continuation code units, etc.)
>> 2. The set of characters considered separators or non-printable characters
>>
>> It seems to me that (1) is going to need some database of encodings
>> already.
>>
>> Additionally, I am not sure that the policy chosen for unassigned
>> codepoints should be the same between Unicode and non-Unicode encodings.
>>
>> Is my analysis reasonable? What are people's thoughts on POSIX locale,
>> ICU, or iconv dependencies from C++ standard libraries as the way to
>> support non-Unicode encodings? Since the specified formatting operation
>> "cannot fail", what is the story when the underlying runtime environment
>> lacks support for the literal encoding (violation of implementation-defined
>> limits due to invalid runtime environment setup)?
>>
>> The alternative (for non-Unicode encodings) seems to be "handle code
>> units that match the encoding of a member of the basic character set,
>> numeric escape everything else".
>>
>> -- HT
>> --
>> SG16 mailing list
>> SG16_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>
>

Received on 2023-07-16 15:10:41