sg16: Re: [SG16-Unicode] [isocpp-core] What is the proper term for the locale dependent run-time character set/encoding used for the character classification and conversion functions?

From: Corentin <corentin.jabot_at_[hidden]>
Date: Tue, 13 Aug 2019 23:56:46 +0200

On Tue, Aug 13, 2019, 11:35 PM <keld_at_[hidden]> wrote:

> On Tue, Aug 13, 2019 at 10:49:09PM +0200, Corentin wrote:
> > On Tue, Aug 13, 2019, 10:34 PM <keld_at_[hidden]> wrote:
> >
> > > For most programs there is no default execution character set nor
> default
> > > execution encoding. A binary program is designed to run with the run
> time
> > > execution character set of the locale it runs with. So the same binary
> > > ??ogram can run with a Japanese encoding or a Danish enoding or arabic
> > > encoding.
> > > There is no knowledge at compilation time what encoding will be used at
> > > run time
> > >
> >
> > The standard assumes there is one. It has to. You cannot not have an
> > encoding.
> > (Of course it is broken but it's a very old assumption).
>
> that encoding is then probably the same as the compile time encoding.
>

You have no control over that no ne does. The compiler will select an
encoding for literals. Then this literals might be interpreted by ie
iostream using an encoding derived from the system locale ( :s) either they
are the same or they are not, in which case in.

the compiler cannot know what the encoding of the execution will be and the
information of what encoding the compiler chose it's not stored.

>
> > Also there is no such thing as a Danish encoding or a Japanese encoding.
> > There is a Danish locale and an encoding attached to that locale (utf8,
> iso
> > 8859). The standard doesn't always makes the distinction - it should)
>
> wel, there ae danish encodings and japanese encodings - multiple encodings
> suitable
> for danish or japanese, and the specific encoding is as you wrote
> attached to the locale
>

Let say you have a neighbor called bjørn.
Is "I am going to see bjørn" not English?
Restricting a language to a limited character set is not matching the
reality. Ergo the idea that a given character set is suitable for a locale
is a bit bogus. Encodings are attached to a character set. And non-Unicode
systems tend to conflate everything. Doesn't make it sensible !

> > But yeah, all of that precludes people to have non ASCII in there source
> as
> > this is currently the only thing that will work portably.
>
> well we worked hard for c++ to have portable source code with non-ascii
> characters,
> and I believe we succeeded
>

I guess you never used windows?

>
> >
> > This is not inherent to C++ which is one reason other languages converged
> > to utf8 as the default/only encoding.
> > (The primary reason being the Unicode character set is actually useful to
> > store text)
>
>
> we did come up with solutions that were non-unicode - unicode is not
> always useful,
> I cannot read chinese nor arabic, but I can use symbolic characters in a
> portable way and ensure
> they are correct and portable, eg. author's names. And we made it happen
> for many SC22 programming
> languages, via work in SC22/WG20
>

Arabic and Chinese alone totally over 1.6 billions people.

> keld
> >
> >
> > > keld
> > >
> > > On Tue, Aug 13, 2019 at 04:10:29PM -0400, Steve Downey wrote:
> > > > Getting back to the original question. I think execution character
> set
> > > and
> > > > execution encoding would refer to the encoding specified by the
> default
> > > > locale, the "C" locale. We do not change the execution encoding via
> calls
> > > > to setlocale(), we change the global default locale to a new locale.
> > > >
> > > > Any name is going to be confusing. I think it's better to just get an
> > > > explicit definition to go together with the term. Something like
> that the
> > > > execution encoding is the same as the default character set
> associated
> > > with
> > > > the default "C" locale, and that it is IF NDR if the actual default
> > > > character set is different than the presumed translation from source
> > > > encoding to execution encoding, or if translation units with
> different
> > > > execution encodings are linked together. IF NDR because I don't see
> how
> > > it
> > > > could always be detected but it can quickly turn into ODR violations
> > > where
> > > > the same named object has different definitions.
> > > >
> > > > On Tue, Aug 13, 2019 at 1:22 PM Corentin <corentin.jabot_at_[hidden]>
> > > wrote:
> > > >
> > > > >
> > > > >
> > > > > On Tue, Aug 13, 2019, 7:08 PM Thiago Macieira <thiago_at_[hidden]
> >
> > > wrote:
> > > > >
> > > > >> On Tuesday, 13 August 2019 09:55:07 PDT Corentin wrote:
> > > > >> > (if anyone is thinking about that, I don't recommend it. You're
> > > going
> > > > >> to run
> > > > >> > into size limits: ICC at 512kB and MSVC at 256kB. Use something
> like
> > > > >> xxd -i
> > > > >> > to generate a brace-delimited array instead)
> > > > >> >
> > > > >> > Afaik that works if you use \x to escape every byte otherwise
> some
> > > > >> > implementation will mess with your data. Nothing is guaranteed
> to be
> > > > >> > passthrough otherwise
> > > > >>
> > > > >> That would be ideal, but the problem I had was the unavailability
> of
> > > > >> proper
> > > > >> tools to convert the input into a form that the C++ compiler could
> > > > >> consume. I
> > > > >> was trying to do with a simple concatenation of a header, data,
> and
> > > > >> footer.
> > > > >>
> > > > >> The end result is a shell script, a Perl script and a powershell
> > > script:
> > > > >> https://codereview.qt-project.org/c/qt/qtbase/+/263548
> > > > >
> > > > >
> > > > > Interesting ! std::embed could be useful there (we are going a bit
> off
> > > > > script). Some kind of raw bytes literals or an implementation that
> > > would
> > > > > optimize parsing arrays of literals such that it is as efficient at
> > > compile
> > > > > time as strings would also be nice.
> > > > >
> > > > >>
> > > > >> --
> > > > >> Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
> > > > >> Software Architect - Intel System Software Products
> > > > >>
> > > > >>
> > > > >>
> > > > >> _______________________________________________
> > > > > SG16 Unicode mailing list
> > > > > Unicode_at_[hidden]
> > > > > http://www.open-std.org/mailman/listinfo/unicode
> > > > >
> > >
> > > > _______________________________________________
> > > > SG16 Unicode mailing list
> > > > Unicode_at_[hidden]
> > > > http://www.open-std.org/mailman/listinfo/unicode
> > >
> > >
>

Received on 2019-08-13 23:57:00