Date: Wed, 3 Jan 2024 00:17:33 -0500
During the 2023-11-29 SG16 meeting, Mateusz requested a list of items to
work on or consider before SG16 resumes review of D3045R0 (Quantities
and units library)
<https://mpusz.github.io/wg21-papers/papers/3045R0_quantities_and_units_library.html>
during the planned 2024-01-24 SG16 meeting.
The following are my personal (non-chair) thoughts regarding changes
that I think are needed. Some of these were mentioned during the prior
review and are reflected in the 2023-11-29 SG16 meeting summary
<https://github.com/sg16-unicode/sg16-meetings#november-29th-2023>.
The encoding of ordinary string literals is implementation-defined
([lex.string]p8 <http://eel.is/c++draft/lex.charset#8> and p9
<http://eel.is/c++draft/lex.charset#9>). The only characters that are
guaranteed by the standard to have representation in the /ordinary
literal encoding/ are those included in the /basic literal character
set/ ([lex.charset]p2 <http://eel.is/c++draft/lex.charset#2> and p7
<http://eel.is/c++draft/lex.charset#7>). This excludes some of the
characters that appear in the paper in string literals passed as
template arguments to named_unit, base_dimension, and prefixed_unit;
'°', 'Θ', 'Ω', 'µ', and 'Δ'. In practice, implementations support use of
encodings like Windows-1252, ISO-8859-1, and EBCDIC as the ordinary
literal encoding (by default in some cases) and these encodings are not
able to represent these characters. Note that use of a
/universal-character-name/ (e.g., "\uXXXX"; [lex.charset]p4
<http://eel.is/c++draft/lex.charset#7>) does not work around encoding
limitations. Implementors are not in a position to force a change of
encoding on their users. That leaves us with the following options:
1. Specify the library as conditionally-supported and only available
when the ordinary literal encoding is a UTF encoding.
2. Change the library to use UTF-8/16/32 string literals and
corresponding char/N/_t types instead of ordinary string literals.
I would be strongly against the first option because it would 1) prevent
use of the library on some platforms, and 2) add adoption friction for
existing code (by requiring migration to UTF-8 as the ordinary literal
encoding) that is otherwise unnecessary.
Given that it has been acknowledged that some users of the library will
be unable to take advantage of non-ASCII characters, I think the
proposed design is right to allow a preferred symbol (one that
potentially uses characters outside the basic literal character set) to
be associated with a given unit with a fallback mechanism in place for
when the preferred symbol can't be used. The only change needed is to
switch the kind of literal used to specify the preferred symbol to one
of the UTF encodings (I don't think it matters which one; allowing the
use of any of them would be ideal) and to retain the ability to specify
an explicit transliteration for use with char and wchar_t-based
interfaces when the preferred symbol uses characters outside the basic
literal character set.
Since the /wide literal encoding/ is also not guaranteed to have
representation for characters outside the basic literal character set, I
think the library should require an explicit transliteration for use
with both char and wchar_t to ensure portability. Perhaps
implementations can relax that requirement when it can be statically
guaranteed that the preferred symbol (specified in a UTF encoding) can
be losslessy transcoded to the associated literal encoding. On the other
hand, perhaps it should be required that fallback symbols be restricted
to characters from the basic literal character set.
There are two situations in which the fallback symbols will be used:
1. When the preferred symbols can't be used (because they lack
representation in the required encoding).
2. When the preferred symbols could be used (because they are
represented in the required encoding) but are not desired (e.g.,
when ASCII-only output is desired).
I think the proposal (and eventual wording) should be clear about these
two situations and when each applies. The proposed units-text-encoding
term for std::format allows for the second case but interacts with the
first in a way that I think needs to be clarified. I think the behavior
that we want is for the preferred symbol to be used by default when the
associated literal encoding is a UTF encoding and for the fallback
symbol to be the default otherwise. Perhaps it should also be an error
to use "U" for that term when the preferred symbol can't be used (due to
lack of representation in the required encoding). Actually, I don't
think the "U" option is useful if the default is to use the preferred
symbol when possible. I think this can be simplified to just a single
option to opt-in to the fallback symbol; perhaps 'b' for "basic literal
character set".
The paper distinguishes between Unicode and ASCII, but that isn't the
right distinction given the existence of EBCDIC-based platforms and
implementations. I think the paper should be amended to to distinguish
between the basic character sets and Unicode as I've tried to do above.
ASCII-only scenarios can still be used as examples and for motivation
though.
Finally, it would be helpful to have an official "P" paper to reference
at least a week before the 2024-01-24 SG16 meeting so that discussion
and minutes can refer to stable section numbers and names.
Tom.
work on or consider before SG16 resumes review of D3045R0 (Quantities
and units library)
<https://mpusz.github.io/wg21-papers/papers/3045R0_quantities_and_units_library.html>
during the planned 2024-01-24 SG16 meeting.
The following are my personal (non-chair) thoughts regarding changes
that I think are needed. Some of these were mentioned during the prior
review and are reflected in the 2023-11-29 SG16 meeting summary
<https://github.com/sg16-unicode/sg16-meetings#november-29th-2023>.
The encoding of ordinary string literals is implementation-defined
([lex.string]p8 <http://eel.is/c++draft/lex.charset#8> and p9
<http://eel.is/c++draft/lex.charset#9>). The only characters that are
guaranteed by the standard to have representation in the /ordinary
literal encoding/ are those included in the /basic literal character
set/ ([lex.charset]p2 <http://eel.is/c++draft/lex.charset#2> and p7
<http://eel.is/c++draft/lex.charset#7>). This excludes some of the
characters that appear in the paper in string literals passed as
template arguments to named_unit, base_dimension, and prefixed_unit;
'°', 'Θ', 'Ω', 'µ', and 'Δ'. In practice, implementations support use of
encodings like Windows-1252, ISO-8859-1, and EBCDIC as the ordinary
literal encoding (by default in some cases) and these encodings are not
able to represent these characters. Note that use of a
/universal-character-name/ (e.g., "\uXXXX"; [lex.charset]p4
<http://eel.is/c++draft/lex.charset#7>) does not work around encoding
limitations. Implementors are not in a position to force a change of
encoding on their users. That leaves us with the following options:
1. Specify the library as conditionally-supported and only available
when the ordinary literal encoding is a UTF encoding.
2. Change the library to use UTF-8/16/32 string literals and
corresponding char/N/_t types instead of ordinary string literals.
I would be strongly against the first option because it would 1) prevent
use of the library on some platforms, and 2) add adoption friction for
existing code (by requiring migration to UTF-8 as the ordinary literal
encoding) that is otherwise unnecessary.
Given that it has been acknowledged that some users of the library will
be unable to take advantage of non-ASCII characters, I think the
proposed design is right to allow a preferred symbol (one that
potentially uses characters outside the basic literal character set) to
be associated with a given unit with a fallback mechanism in place for
when the preferred symbol can't be used. The only change needed is to
switch the kind of literal used to specify the preferred symbol to one
of the UTF encodings (I don't think it matters which one; allowing the
use of any of them would be ideal) and to retain the ability to specify
an explicit transliteration for use with char and wchar_t-based
interfaces when the preferred symbol uses characters outside the basic
literal character set.
Since the /wide literal encoding/ is also not guaranteed to have
representation for characters outside the basic literal character set, I
think the library should require an explicit transliteration for use
with both char and wchar_t to ensure portability. Perhaps
implementations can relax that requirement when it can be statically
guaranteed that the preferred symbol (specified in a UTF encoding) can
be losslessy transcoded to the associated literal encoding. On the other
hand, perhaps it should be required that fallback symbols be restricted
to characters from the basic literal character set.
There are two situations in which the fallback symbols will be used:
1. When the preferred symbols can't be used (because they lack
representation in the required encoding).
2. When the preferred symbols could be used (because they are
represented in the required encoding) but are not desired (e.g.,
when ASCII-only output is desired).
I think the proposal (and eventual wording) should be clear about these
two situations and when each applies. The proposed units-text-encoding
term for std::format allows for the second case but interacts with the
first in a way that I think needs to be clarified. I think the behavior
that we want is for the preferred symbol to be used by default when the
associated literal encoding is a UTF encoding and for the fallback
symbol to be the default otherwise. Perhaps it should also be an error
to use "U" for that term when the preferred symbol can't be used (due to
lack of representation in the required encoding). Actually, I don't
think the "U" option is useful if the default is to use the preferred
symbol when possible. I think this can be simplified to just a single
option to opt-in to the fallback symbol; perhaps 'b' for "basic literal
character set".
The paper distinguishes between Unicode and ASCII, but that isn't the
right distinction given the existence of EBCDIC-based platforms and
implementations. I think the paper should be amended to to distinguish
between the basic character sets and Unicode as I've tried to do above.
ASCII-only scenarios can still be used as examples and for motivation
though.
Finally, it would be helpful to have an official "P" paper to reference
at least a week before the 2024-01-24 SG16 meeting so that discussion
and minutes can refer to stable section numbers and names.
Tom.
Received on 2024-01-03 05:17:40