C++ Logo

sg16

Advanced search

Re: [isocpp-lib-ext] std::environment

From: Tom Honermann <tom_at_[hidden]>
Date: Thu, 5 Jan 2023 16:10:04 -0500
On 1/4/23 10:05 PM, Thiago Macieira via SG16 wrote:
> On Monday, 2 January 2023 16:09:24 -03 Tom Honermann via SG16 wrote:
>> Thank you for those links. Reading the documentation for them reinforced
>> my belief that programmers have a need to access raw environment
>> variable values (as in qgetenv()
>> <https://doc.qt.io/qt-6/qtglobal.html#qgetenv>), but without requiring
>> conversion to char or byte-based storage (e.g., raw wchar_t access on
>> Windows) and to access values as text (as in qEnvironmentVariable()
>> <https://doc.qt.io/qt-6/qtglobal.html#qEnvironmentVariable> via
>> conversion to the associated encodings of char, wchar_t, char8_t,
>> char16_t, and char32_t with the understanding that such conversion will
>> be lossy in some cases).
> That's true, but extremely rare. The vast majority of environment variables
> contain file names and other filesystem-adjacent names, which is why I added
> qEnvironmentVariable in the first place: to prevent people from forgetting to
> apply the proper conversion from 8-bit to file-name encoding. Yes, it's
> possible to store random binary data in environment variables, but I don't
> know anyone who does that any more than in file names.
>
> The next category of uses for environment variables, at least inside Qt itself
> and libraries, are simple counter/boolean values.
>
> Anyway, we have a solution already:
>
>> The design used for std::filesystem::path will
>> suffice for both purposes with a minor tweak; we should provide separate
>> interfaces for access to the raw data vs access as text so that the
>> latter can provide valid encoding guarantees. For example, given an
>> environment variable FOO with the value "a\xFF\xFFz" (four bytes long
>> containing the values 'a', 0xFF, 0xFF, 'z') on a POSIX system using
>> UTF-8 for the execution encoding and UTF-32 for the wide execution
>> encoding, access of the value via the following member functions would
>> yield results with the indicated type and value (where encoding
>> conversion is from UTF-8 (the execution encoding) and follows Unicode
>> PR-121<http://unicode.org/review/pr-121.html> policy 1 for substitution
>> of ill-formed code unit sequences; U+FFFD is the Unicode replacement
>> character).
>>
>> * raw() -> std::span<char> (std::span<wchar_t> on Windows) where the
>> spanned range is "a\xFF\xFFz".
>> * string() -> std::string containing "a\uFFFDz" (UTF-8).
>> * wstring() -> std::wstring containing L"a\uFFFDz" (UTF-32).
>> * u8string() -> std::u8sstring containing u8"a\uFFFDz" (UTF-8).
>> * u16string() -> std::u16string containing u"a\uFFFDz" (UTF-16).
>> * u32string() -> std::u32string containing U"a\uFFFDz" (UTF-32).
> Agreed for Unix systems. But on Windows, the original string as obtained from
> GetEnvironmentVariableW is u"a\u00ff\u00ffz". It's not ill-formed at all, so
> there should not be any U+FFFD there.
> But it may have unpaired surrogates in its UTF-16 form, so it couldn't be
> encoded in UTF-8 or UTF-32 properly (WTF-8 would work).

Agreed; that is why I stated POSIX specifically, but then I confused
things by mentioning Windows later.

A Windows specific example follows. Given an environment variable
(retrieved with GetEnvironmentVariableW(), so wchar_t based) containing
a reversed surrogate code point sequence, L"a\xDC00\xD800z", the
resulting values might be as below. For funsies, I'm assigning
Windows-1252 as the associated encoding for char and UCS-2 as the
associated encoding for wchar_t. (note that if the surrogate code points
were in the correct order, substantially the same results would be
produced since, though the surrogate code points are valid code points
in UCS-2, they don't have associated characters; only the raw() and
wstring() results would differ).

  * raw() -> std::span<wchar_t> where the spanned range is
    L"a\xDC00\xD800z".
  * string() -> std::string containing "a?z" (Windows-1252).
  * wstring() -> std::wstring containing L"a\xDC00\xD800z" (UCS-2).
  * u8string() -> std::u8sstring containing u8"a\uFFFDz" (UTF-8).
  * u16string() -> std::u16string containing u"a\uFFFDz" (UTF-16).
  * u32string() -> std::u32string containing U"a\uFFFDz" (UTF-32).

Tom.

Received on 2023-01-05 21:10:05