On 1/4/23 10:05 PM, Thiago Macieira via SG16 wrote:

On Monday, 2 January 2023 16:09:24 -03 Tom Honermann via SG16 wrote:

Thank you for those links. Reading the documentation for them reinforced
my belief that programmers have a need to access raw environment
variable values (as in qgetenv()
<https://doc.qt.io/qt-6/qtglobal.html#qgetenv>), but without requiring
conversion to char or byte-based storage (e.g., raw wchar_t access on
Windows) and to access values as text (as in qEnvironmentVariable()
<https://doc.qt.io/qt-6/qtglobal.html#qEnvironmentVariable> via
conversion to the associated encodings of char, wchar_t, char8_t,
char16_t, and char32_t with the understanding that such conversion will
be lossy in some cases).

That's true, but extremely rare. The vast majority of environment variables 
contain file names and other filesystem-adjacent names, which is why I added 
qEnvironmentVariable in the first place: to prevent people from forgetting to 
apply the proper conversion from 8-bit to file-name encoding. Yes, it's 
possible to store random binary data in environment variables, but I don't 
know anyone who does that any more than in file names.

The next category of uses for environment variables, at least inside Qt itself 
and libraries, are simple counter/boolean values.

Anyway, we have a solution already:

The design used for std::filesystem::path will
suffice for both purposes with a minor tweak; we should provide separate
interfaces for access to the raw data vs access as text so that the
latter can provide valid encoding guarantees. For example, given an
environment variable FOO with the value "a\xFF\xFFz" (four bytes long
containing the values 'a', 0xFF, 0xFF, 'z') on a POSIX system using
UTF-8 for the execution encoding and UTF-32 for the wide execution
encoding, access of the value via the following member functions would
yield results with the indicated type and value (where encoding
conversion is from UTF-8 (the execution encoding) and follows Unicode
PR-121 <http://unicode.org/review/pr-121.html> policy 1 for substitution
of ill-formed code unit sequences; U+FFFD is the Unicode replacement
character).

  * raw() -> std::span<char> (std::span<wchar_t> on Windows) where the
    spanned range is "a\xFF\xFFz".
  * string() -> std::string containing "a\uFFFDz" (UTF-8).
  * wstring() -> std::wstring containing L"a\uFFFDz" (UTF-32).
  * u8string() -> std::u8sstring containing u8"a\uFFFDz" (UTF-8).
  * u16string() -> std::u16string containing u"a\uFFFDz" (UTF-16).
  * u32string() -> std::u32string containing U"a\uFFFDz" (UTF-32).

Agreed for Unix systems. But on Windows, the original string as obtained from 
GetEnvironmentVariableW is u"a\u00ff\u00ffz". It's not ill-formed at all, so 
there should not be any U+FFFD there.

But it may have unpaired surrogates in its UTF-16 form, so it couldn't be 
encoded in UTF-8 or UTF-32 properly (WTF-8 would work).

Agreed; that is why I stated POSIX specifically, but then I confused things by mentioning Windows later.

A Windows specific example follows. Given an environment variable (retrieved with GetEnvironmentVariableW(), so wchar_t based) containing a reversed surrogate code point sequence, L"a\xDC00\xD800z", the resulting values might be as below. For funsies, I'm assigning Windows-1252 as the associated encoding for char and UCS-2 as the associated encoding for wchar_t. (note that if the surrogate code points were in the correct order, substantially the same results would be produced since, though the surrogate code points are valid code points in UCS-2, they don't have associated characters; only the raw() and wstring() results would differ).

raw() -> std::span<wchar_t> where the spanned range is L"a\xDC00\xD800z".
string() -> std::string containing "a?z" (Windows-1252).
wstring() -> std::wstring containing L"a\xDC00\xD800z" (UCS-2).
u8string() -> std::u8sstring containing u8"a\uFFFDz" (UTF-8).
u16string() -> std::u16string containing u"a\uFFFDz" (UTF-16).
u32string() -> std::u32string containing U"a\uFFFDz" (UTF-32).

Tom.