On Monday, 2 January 2023 16:09:24 -03 Tom Honermann via SG16 wrote:Thank you for those links. Reading the documentation for them reinforced my belief that programmers have a need to access raw environment variable values (as in qgetenv() <https://doc.qt.io/qt-6/qtglobal.html#qgetenv>), but without requiring conversion to char or byte-based storage (e.g., raw wchar_t access on Windows) and to access values as text (as in qEnvironmentVariable() <https://doc.qt.io/qt-6/qtglobal.html#qEnvironmentVariable> via conversion to the associated encodings of char, wchar_t, char8_t, char16_t, and char32_t with the understanding that such conversion will be lossy in some cases).That's true, but extremely rare. The vast majority of environment variables contain file names and other filesystem-adjacent names, which is why I added qEnvironmentVariable in the first place: to prevent people from forgetting to apply the proper conversion from 8-bit to file-name encoding. Yes, it's possible to store random binary data in environment variables, but I don't know anyone who does that any more than in file names. The next category of uses for environment variables, at least inside Qt itself and libraries, are simple counter/boolean values. Anyway, we have a solution already:The design used for std::filesystem::path will suffice for both purposes with a minor tweak; we should provide separate interfaces for access to the raw data vs access as text so that the latter can provide valid encoding guarantees. For example, given an environment variable FOO with the value "a\xFF\xFFz" (four bytes long containing the values 'a', 0xFF, 0xFF, 'z') on a POSIX system using UTF-8 for the execution encoding and UTF-32 for the wide execution encoding, access of the value via the following member functions would yield results with the indicated type and value (where encoding conversion is from UTF-8 (the execution encoding) and follows Unicode PR-121 <http://unicode.org/review/pr-121.html> policy 1 for substitution of ill-formed code unit sequences; U+FFFD is the Unicode replacement character). * raw() -> std::span<char> (std::span<wchar_t> on Windows) where the spanned range is "a\xFF\xFFz". * string() -> std::string containing "a\uFFFDz" (UTF-8). * wstring() -> std::wstring containing L"a\uFFFDz" (UTF-32). * u8string() -> std::u8sstring containing u8"a\uFFFDz" (UTF-8). * u16string() -> std::u16string containing u"a\uFFFDz" (UTF-16). * u32string() -> std::u32string containing U"a\uFFFDz" (UTF-32).Agreed for Unix systems. But on Windows, the original string as obtained from GetEnvironmentVariableW is u"a\u00ff\u00ffz". It's not ill-formed at all, so there should not be any U+FFFD there.
But it may have unpaired surrogates in its UTF-16 form, so it couldn't be encoded in UTF-8 or UTF-32 properly (WTF-8 would work).
Agreed; that is why I stated POSIX specifically, but then I confused things by mentioning Windows later.
A Windows specific example follows. Given an environment variable
(retrieved with GetEnvironmentVariableW(),
so wchar_t based) containing a
reversed surrogate code point sequence, L"a\xDC00\xD800z",
the resulting values might be as below. For funsies, I'm assigning
Windows-1252 as the associated encoding for char and UCS-2 as the associated
encoding for wchar_t. (note that if
the surrogate code points were in the correct order, substantially
the same results would be produced since, though the surrogate
code points are valid code points in UCS-2, they don't have
associated characters; only the raw()
and wstring() results would