C++ Logo

sg16

Advanced search

Re: [SG16] P1885 polling

From: Corentin <corentin.jabot_at_[hidden]>
Date: Thu, 23 Sep 2021 13:04:33 +0200
On Thu, Sep 23, 2021 at 12:17 PM Jens Maurer <Jens.Maurer_at_[hidden]> wrote:

> On 23/09/2021 11.55, Corentin wrote:
> >
> >
> > On Thu, Sep 23, 2021 at 8:21 AM Jens Maurer <Jens.Maurer_at_[hidden]
> <mailto:Jens.Maurer_at_[hidden]>> wrote:
> >
> > On 23/09/2021 07.21, Corentin via SG16 wrote:
> > > I would like to know if you have sustained objections such that
> you do not want to see this paper polled, because that's currently not
> clear to me.
> >
> > At this time (thanks Hubert for the digging), I think the normative
> wording
> > is sufficiently unclear in its intent that I'm strong opposed to
> forwarding
> > this paper to LWG.
> >
> > Maybe some notes or "Recommended practice" sections would help
> convey what
> > we want implementations to do, if we can't describe that normatively
> with
> > sufficient precision.
> >
> >
> > Recommended practice: Implementations should return a value that
> represents an en-
> > coding whose code unit size matches the size of a single wchar_t.
>
> But, apparently there are very few of those.
>
> If we want to stick to the IANA table, would it be a better direction to
> say that "EBCDIC-US" can be used as both a narrow and wide encoding, with
> the understanding that the wide encoding is (trivially) established from
> the (specified) narrow encoding by taking the (unsigned) numerical values
> of the narrow encoding and using those as the values of the wide encoding?
> (We could even say that so in the normative wording.)
>

It's two different encodings, they should have different names.
Given a blob of memory, and an encoding name, if you can't reasonably
interpret that blob to a sequence of characters, the whole endeavor proves
meaningless.
given the string "aaa" the encoded "x61x61x61" and "x00x61x00x61x00x61" are
clearly not the same thing to me.

This is important given that for example iconv has no wide interface, and
would have no way to distinguish the two cases if the name is not identical.
That the industry decided to have few 16+ bits encoding should not
encourage us to make more out of thin air.

However, if we want the standard to say that

"if the value returned by wide_environment() matches an encoding E whose
associated code unit type is char, then it represents a similar encoding
with code units of type wchar_t encoding the same sequence of numeric value
as E"

I guess we can?

All the query functions are named so that narrow/wide is differentiated;
> the mib id numbers would not represent that differentiation. That seems
> an ok trade-off to me.
>

> > > If so, I would like to know what direction you would like this
> paper to take.
> > >
> > > * We already made the wording as wide as possible, because it was
> always the intent of this paper to be on a best effort basis (I do not
> think a perfect solution can be found). I do believe the wording matches
> the intent sufficiently, please let me know if you think that's still not
> the case.
> >
> > See my separate e-mail. I can't divine the intent from the wording
> right now.
> >
> > > * We can remove wide methods. I'd argue that, at the very least,
> it's still useful for users to distinguish the few known and well-paved
> scenarios from everything else such that for example if an user expects
> utf-32 on posix they can check for that. Returning something like
> "x-ISO8859-1" is also useful on introspection, even if by definition this
> is very much none portable.
> >
> > I'd guess that Hubert has situations where some wide-EBCDIC encoding
> is used.
> > Also, it feels asymmetric to talk about just narrow encodings, but
> not wide
> > ones.
> >
> >
> > Agreed.
> > But I'd rather... find a way to move forward?
> >
> >
> > > * We can stop pursuing this paper.
> >
> > * We can divorce ourselves from the obviously broken IANA registry
> > (possibly just rely on their "character set definitions", but not
> claim
> > those are actual encoding designations)
> >
> >
> > "Obviously broken" is a rather big claim in the absence of suitable
> alternatives.
> > I believe the use of the IANA registry is motivated by the paper and
> previous polls.
>
> I think the polls are are not sufficiently precise to argue for the case
> that what the IANA table describes (by implication) as a narrow encoding
> cannot be re-used to designate a wide encoding trivially derived from the
> narrow encoding.
>

We could also apply "trivially" a caesar cipher to utf-8, but it would not
be utf-8!

>
> > I did however modify the paper to use more correct terminology and added
> a note to explain that our terminology differs, which will hopefully avoid
> confusion
>
> Good.
>
> Jens
>
>

Received on 2021-09-23 06:04:46