Date: Sun, 16 Jul 2023 07:29:06 -0700
Hi Hubert and other Unicoders,
The original intent of the escaping proposal (or rather the escaping part
of the range formatting proposal) was to correctly handle the Unicode case
with implementation-defined behavior for other encodings (QoI). I think
it's reasonable to expect from implementations to only provide the "ideal"
behavior for a subset of encodings with the bare minimum (required) being
UTF-N variants. The dependency on ICU or a similar facility for the subset
of encodings is probably an overkill because escaping requires a fairly
small amount of information, even for the Unicode case. Failures and
unsupported encodings can always be handled by falling back on overescaping.
> Additionally, I am not sure that the policy chosen for unassigned
codepoints should be the same between Unicode and non-Unicode encodings.
Could you elaborate?
Cheers,
Victor
On Wed, Jul 12, 2023 at 4:38 PM Hubert Tong via SG16 <sg16_at_[hidden]>
wrote:
> Hi SG 16:
>
> When escaping strings (an operation likely done at runtime), some
> information about the literal encoding (a property of the compilation
> environment) is needed.
>
> For "ideal behaviour", it seems to me that the ability to hardcode/capture
> at compile time/deploy with the runtime is needed for the following:
> 1. Understanding of the encoding scheme (e.g., valid initial code units,
> valid continuation code units, etc.)
> 2. The set of characters considered separators or non-printable characters
>
> It seems to me that (1) is going to need some database of encodings
> already.
>
> Additionally, I am not sure that the policy chosen for unassigned
> codepoints should be the same between Unicode and non-Unicode encodings.
>
> Is my analysis reasonable? What are people's thoughts on POSIX locale,
> ICU, or iconv dependencies from C++ standard libraries as the way to
> support non-Unicode encodings? Since the specified formatting operation
> "cannot fail", what is the story when the underlying runtime environment
> lacks support for the literal encoding (violation of implementation-defined
> limits due to invalid runtime environment setup)?
>
> The alternative (for non-Unicode encodings) seems to be "handle code units
> that match the encoding of a member of the basic character set, numeric
> escape everything else".
>
> -- HT
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>
The original intent of the escaping proposal (or rather the escaping part
of the range formatting proposal) was to correctly handle the Unicode case
with implementation-defined behavior for other encodings (QoI). I think
it's reasonable to expect from implementations to only provide the "ideal"
behavior for a subset of encodings with the bare minimum (required) being
UTF-N variants. The dependency on ICU or a similar facility for the subset
of encodings is probably an overkill because escaping requires a fairly
small amount of information, even for the Unicode case. Failures and
unsupported encodings can always be handled by falling back on overescaping.
> Additionally, I am not sure that the policy chosen for unassigned
codepoints should be the same between Unicode and non-Unicode encodings.
Could you elaborate?
Cheers,
Victor
On Wed, Jul 12, 2023 at 4:38 PM Hubert Tong via SG16 <sg16_at_[hidden]>
wrote:
> Hi SG 16:
>
> When escaping strings (an operation likely done at runtime), some
> information about the literal encoding (a property of the compilation
> environment) is needed.
>
> For "ideal behaviour", it seems to me that the ability to hardcode/capture
> at compile time/deploy with the runtime is needed for the following:
> 1. Understanding of the encoding scheme (e.g., valid initial code units,
> valid continuation code units, etc.)
> 2. The set of characters considered separators or non-printable characters
>
> It seems to me that (1) is going to need some database of encodings
> already.
>
> Additionally, I am not sure that the policy chosen for unassigned
> codepoints should be the same between Unicode and non-Unicode encodings.
>
> Is my analysis reasonable? What are people's thoughts on POSIX locale,
> ICU, or iconv dependencies from C++ standard libraries as the way to
> support non-Unicode encodings? Since the specified formatting operation
> "cannot fail", what is the story when the underlying runtime environment
> lacks support for the literal encoding (violation of implementation-defined
> limits due to invalid runtime environment setup)?
>
> The alternative (for non-Unicode encodings) seems to be "handle code units
> that match the encoding of a member of the basic character set, numeric
> escape everything else".
>
> -- HT
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>
Received on 2023-07-16 14:29:18