Date: Wed, 6 Oct 2021 18:34:46 +0200
On Wed, Oct 6, 2021 at 5:24 PM Tom Honermann <tom_at_[hidden]> wrote:
> On 10/6/21 11:05 AM, Corentin Jabot wrote:
>
>
>
> On Wed, Oct 6, 2021 at 4:53 PM Jens Maurer <Jens.Maurer_at_[hidden]> wrote:
>
>> On 06/10/2021 16.42, Corentin Jabot wrote:
>> >
>> >
>> > On Wed, Oct 6, 2021 at 4:02 PM Jens Maurer <Jens.Maurer_at_[hidden]
>> <mailto:Jens.Maurer_at_[hidden]>> wrote:
>> >
>>
>> > I'm trying to understand how the IANA table, the specific values in
>> that table,
>> > the encodings those values represent, the use of "encoding form"
>> vs. "encoding
>> > scheme", and the use of integers (not octets) to initialize
>> wchar_t's all fit
>> > together. So far, there is friction that we need to resolve, in my
>> view.
>> >
>> >
>> > There is wording that Hubert asks for that says that how these things
>> relate is implementation defined.
>>
>> And I think that's not helpful for portable code.
>>
>> > A non-hostile implementation would return a registered encoding that
>> has a code unit size of CHAR_BITS for narrow function, and a registered
>> encoding that has a code unit size of sizeof(wchar_t) for wide functions
>> (if it exists). The byte order of wide string literal is platform specific
>> and P1885 has no bearing on that. P1885 also does not affect how wchar_t
>> represents values.
>> > IANA does not specify a byte order in the general case (merely that
>> there is one), so we are not running afoul of anything.
>> > And "encoding form" vs. "encoding scheme" is Unicode specific.
>>
>> The question of "encoding form" vs. "encoding scheme" arises for any
>> wchar_t encoding in the context of the IANA table, but there appear
>> to be very few encodings specified as integers as opposed to
>> sequences of bytes.
>>
>
> More like 0
>
>>
>> I'm curious how wchar_t is treated in a non-Unicode situation.
>> Even something like Big5 https://en.wikipedia.org/wiki/Big5
>> appears to be byte-based, not integer-based:
>>
>> First byte ("lead byte") 0x81 to 0xfe (or 0xa1 to 0xf9 for
>> non-user-defined characters)
>> Second byte 0x40 to 0x7e, 0xa1 to 0xfe
>
>
>> So, it seems to be a multibyte encoding, not a wide one.
>>
>
> Sure, because it predates unicode terminology. But the concept is the same.
> A code unit is still 2 byes, these things cannot be further splitted.
> There is no character in big5 that is encoded as a single byte.
>
> A UTF-16 code unit is also 2 bytes.
>
> I disagree with that, at least in general. a UTF-16 code unit fits in a
> single byte when CHAR_BIT is >= 16.
>
Sure? Octet.
> wchar_t is suitable to represent any encoding that represent a character
> in N bytes (or a sequences of N bytes), for N = sizeof(wchar_t)/CHAR_BITS
>
> Once we lift the restriction in [basic.fundamental]p8
> <http://eel.is/c++draft/basic.fundamental#8>, yes.
>
>
>
>>
>> > The C++ specification and implementations produce and have expectations
>> about strings.
>> > If the strings produced or the expectations match the description of a
>> given existing known encoding, then this encoding is suitable to label the
>> strings and expectations of the C++ program, otherwise it isn't.
>> > I'm really struggling to see where the contention is here.
>>
>> The contention is that [lex.string] initializes wchar_t's with
>> (potentially large) integer values (which I understand to be
>> "encoding forms" in Unicode parlance), but the RFC accompanying
>> the IANA table says the encodings described there are octet-based
>> encodings, which I understand to be "encoding schemes" in
>> Unicode parlance.
>>
>
> Does the wording suggested by Hubert (of specifying we are talking about
> object representation) addresses your concern?
> We are talking about initialized strings, not what they have been
> initialized with.
>
> I think the distinction between object representation and sequence of
> string elements remains a point of contention. Resolving this will be a
> goal of our meeting today.
>
Please keep in mind that iconv and other interfaces, like QTextDecoder
always convert between sequences of bytes, if that is an use case we think
is important,
then caring about the value is not enough. and we want to discourage 0
padding.
> Tom.
>
> On 10/6/21 11:05 AM, Corentin Jabot wrote:
>
>
>
> On Wed, Oct 6, 2021 at 4:53 PM Jens Maurer <Jens.Maurer_at_[hidden]> wrote:
>
>> On 06/10/2021 16.42, Corentin Jabot wrote:
>> >
>> >
>> > On Wed, Oct 6, 2021 at 4:02 PM Jens Maurer <Jens.Maurer_at_[hidden]
>> <mailto:Jens.Maurer_at_[hidden]>> wrote:
>> >
>>
>> > I'm trying to understand how the IANA table, the specific values in
>> that table,
>> > the encodings those values represent, the use of "encoding form"
>> vs. "encoding
>> > scheme", and the use of integers (not octets) to initialize
>> wchar_t's all fit
>> > together. So far, there is friction that we need to resolve, in my
>> view.
>> >
>> >
>> > There is wording that Hubert asks for that says that how these things
>> relate is implementation defined.
>>
>> And I think that's not helpful for portable code.
>>
>> > A non-hostile implementation would return a registered encoding that
>> has a code unit size of CHAR_BITS for narrow function, and a registered
>> encoding that has a code unit size of sizeof(wchar_t) for wide functions
>> (if it exists). The byte order of wide string literal is platform specific
>> and P1885 has no bearing on that. P1885 also does not affect how wchar_t
>> represents values.
>> > IANA does not specify a byte order in the general case (merely that
>> there is one), so we are not running afoul of anything.
>> > And "encoding form" vs. "encoding scheme" is Unicode specific.
>>
>> The question of "encoding form" vs. "encoding scheme" arises for any
>> wchar_t encoding in the context of the IANA table, but there appear
>> to be very few encodings specified as integers as opposed to
>> sequences of bytes.
>>
>
> More like 0
>
>>
>> I'm curious how wchar_t is treated in a non-Unicode situation.
>> Even something like Big5 https://en.wikipedia.org/wiki/Big5
>> appears to be byte-based, not integer-based:
>>
>> First byte ("lead byte") 0x81 to 0xfe (or 0xa1 to 0xf9 for
>> non-user-defined characters)
>> Second byte 0x40 to 0x7e, 0xa1 to 0xfe
>
>
>> So, it seems to be a multibyte encoding, not a wide one.
>>
>
> Sure, because it predates unicode terminology. But the concept is the same.
> A code unit is still 2 byes, these things cannot be further splitted.
> There is no character in big5 that is encoded as a single byte.
>
> A UTF-16 code unit is also 2 bytes.
>
> I disagree with that, at least in general. a UTF-16 code unit fits in a
> single byte when CHAR_BIT is >= 16.
>
Sure? Octet.
> wchar_t is suitable to represent any encoding that represent a character
> in N bytes (or a sequences of N bytes), for N = sizeof(wchar_t)/CHAR_BITS
>
> Once we lift the restriction in [basic.fundamental]p8
> <http://eel.is/c++draft/basic.fundamental#8>, yes.
>
>
>
>>
>> > The C++ specification and implementations produce and have expectations
>> about strings.
>> > If the strings produced or the expectations match the description of a
>> given existing known encoding, then this encoding is suitable to label the
>> strings and expectations of the C++ program, otherwise it isn't.
>> > I'm really struggling to see where the contention is here.
>>
>> The contention is that [lex.string] initializes wchar_t's with
>> (potentially large) integer values (which I understand to be
>> "encoding forms" in Unicode parlance), but the RFC accompanying
>> the IANA table says the encodings described there are octet-based
>> encodings, which I understand to be "encoding schemes" in
>> Unicode parlance.
>>
>
> Does the wording suggested by Hubert (of specifying we are talking about
> object representation) addresses your concern?
> We are talking about initialized strings, not what they have been
> initialized with.
>
> I think the distinction between object representation and sequence of
> string elements remains a point of contention. Resolving this will be a
> goal of our meeting today.
>
Please keep in mind that iconv and other interfaces, like QTextDecoder
always convert between sequences of bytes, if that is an use case we think
is important,
then caring about the value is not enough. and we want to discourage 0
padding.
> Tom.
>
Received on 2021-10-06 11:35:28