Date: Thu, 05 Jan 2023 00:05:00 -0300
On Monday, 2 January 2023 16:09:24 -03 Tom Honermann via SG16 wrote:
> Thank you for those links. Reading the documentation for them reinforced
> my belief that programmers have a need to access raw environment
> variable values (as in qgetenv()
> <https://doc.qt.io/qt-6/qtglobal.html#qgetenv>), but without requiring
> conversion to char or byte-based storage (e.g., raw wchar_t access on
> Windows) and to access values as text (as in qEnvironmentVariable()
> <https://doc.qt.io/qt-6/qtglobal.html#qEnvironmentVariable> via
> conversion to the associated encodings of char, wchar_t, char8_t,
> char16_t, and char32_t with the understanding that such conversion will
> be lossy in some cases).
That's true, but extremely rare. The vast majority of environment variables
contain file names and other filesystem-adjacent names, which is why I added
qEnvironmentVariable in the first place: to prevent people from forgetting to
apply the proper conversion from 8-bit to file-name encoding. Yes, it's
possible to store random binary data in environment variables, but I don't
know anyone who does that any more than in file names.
The next category of uses for environment variables, at least inside Qt itself
and libraries, are simple counter/boolean values.
Anyway, we have a solution already:
> The design used for std::filesystem::path will
> suffice for both purposes with a minor tweak; we should provide separate
> interfaces for access to the raw data vs access as text so that the
> latter can provide valid encoding guarantees. For example, given an
> environment variable FOO with the value "a\xFF\xFFz" (four bytes long
> containing the values 'a', 0xFF, 0xFF, 'z') on a POSIX system using
> UTF-8 for the execution encoding and UTF-32 for the wide execution
> encoding, access of the value via the following member functions would
> yield results with the indicated type and value (where encoding
> conversion is from UTF-8 (the execution encoding) and follows Unicode
> PR-121 <http://unicode.org/review/pr-121.html> policy 1 for substitution
> of ill-formed code unit sequences; U+FFFD is the Unicode replacement
> character).
>
> * raw() -> std::span<char> (std::span<wchar_t> on Windows) where the
> spanned range is "a\xFF\xFFz".
> * string() -> std::string containing "a\uFFFDz" (UTF-8).
> * wstring() -> std::wstring containing L"a\uFFFDz" (UTF-32).
> * u8string() -> std::u8sstring containing u8"a\uFFFDz" (UTF-8).
> * u16string() -> std::u16string containing u"a\uFFFDz" (UTF-16).
> * u32string() -> std::u32string containing U"a\uFFFDz" (UTF-32).
Agreed for Unix systems. But on Windows, the original string as obtained from
GetEnvironmentVariableW is u"a\u00ff\u00ffz". It's not ill-formed at all, so
there should not be any U+FFFD there.
But it may have unpaired surrogates in its UTF-16 form, so it couldn't be
encoded in UTF-8 or UTF-32 properly (WTF-8 would work).
> Thank you for those links. Reading the documentation for them reinforced
> my belief that programmers have a need to access raw environment
> variable values (as in qgetenv()
> <https://doc.qt.io/qt-6/qtglobal.html#qgetenv>), but without requiring
> conversion to char or byte-based storage (e.g., raw wchar_t access on
> Windows) and to access values as text (as in qEnvironmentVariable()
> <https://doc.qt.io/qt-6/qtglobal.html#qEnvironmentVariable> via
> conversion to the associated encodings of char, wchar_t, char8_t,
> char16_t, and char32_t with the understanding that such conversion will
> be lossy in some cases).
That's true, but extremely rare. The vast majority of environment variables
contain file names and other filesystem-adjacent names, which is why I added
qEnvironmentVariable in the first place: to prevent people from forgetting to
apply the proper conversion from 8-bit to file-name encoding. Yes, it's
possible to store random binary data in environment variables, but I don't
know anyone who does that any more than in file names.
The next category of uses for environment variables, at least inside Qt itself
and libraries, are simple counter/boolean values.
Anyway, we have a solution already:
> The design used for std::filesystem::path will
> suffice for both purposes with a minor tweak; we should provide separate
> interfaces for access to the raw data vs access as text so that the
> latter can provide valid encoding guarantees. For example, given an
> environment variable FOO with the value "a\xFF\xFFz" (four bytes long
> containing the values 'a', 0xFF, 0xFF, 'z') on a POSIX system using
> UTF-8 for the execution encoding and UTF-32 for the wide execution
> encoding, access of the value via the following member functions would
> yield results with the indicated type and value (where encoding
> conversion is from UTF-8 (the execution encoding) and follows Unicode
> PR-121 <http://unicode.org/review/pr-121.html> policy 1 for substitution
> of ill-formed code unit sequences; U+FFFD is the Unicode replacement
> character).
>
> * raw() -> std::span<char> (std::span<wchar_t> on Windows) where the
> spanned range is "a\xFF\xFFz".
> * string() -> std::string containing "a\uFFFDz" (UTF-8).
> * wstring() -> std::wstring containing L"a\uFFFDz" (UTF-32).
> * u8string() -> std::u8sstring containing u8"a\uFFFDz" (UTF-8).
> * u16string() -> std::u16string containing u"a\uFFFDz" (UTF-16).
> * u32string() -> std::u32string containing U"a\uFFFDz" (UTF-32).
Agreed for Unix systems. But on Windows, the original string as obtained from
GetEnvironmentVariableW is u"a\u00ff\u00ffz". It's not ill-formed at all, so
there should not be any U+FFFD there.
But it may have unpaired surrogates in its UTF-16 form, so it couldn't be
encoded in UTF-8 or UTF-32 properly (WTF-8 would work).
-- Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org Software Architect - Intel DCAI Cloud Engineering
Received on 2023-01-05 03:05:59