sg16: Re: [SG16-Unicode] [isocpp-core] What is the proper term for the locale dependent run-time character set/encoding used for the character classification and conversion functions?

From: keld_at <keld_at_[hidden]>
Date: Wed, 14 Aug 2019 00:29:29 +0200

On Tue, Aug 13, 2019 at 11:56:46PM +0200, Corentin wrote:
> On Tue, Aug 13, 2019, 11:35 PM <keld_at_[hidden]> wrote:
>
> > On Tue, Aug 13, 2019 at 10:49:09PM +0200, Corentin wrote:
> > > On Tue, Aug 13, 2019, 10:34 PM <keld_at_[hidden]> wrote:
> > >
> > > > For most programs there is no default execution character set nor
> > default
> > > > execution encoding. A binary program is designed to run with the run
> > time
> > > > execution character set of the locale it runs with. So the same binary
> > > > ??ogram can run with a Japanese encoding or a Danish enoding or arabic
> > > > encoding.
> > > > There is no knowledge at compilation time what encoding will be used at
> > > > run time
> > > >
> > >
> > > The standard assumes there is one. It has to. You cannot not have an
> > > encoding.
> > > (Of course it is broken but it's a very old assumption).
> >
> > that encoding is then probably the same as the compile time encoding.
> >
>
>
> You have no control over that no ne does. The compiler will select an
> encoding for literals. Then this literals might be interpreted by ie
> iostream using an encoding derived from the system locale ( :s) either they
> are the same or they are not, in which case in.
>
> the compiler cannot know what the encoding of the execution will be and the
> information of what encoding the compiler chose it's not stored.

you are probably right - at least some compilers would behave like this.

> >
> > > Also there is no such thing as a Danish encoding or a Japanese encoding.
> > > There is a Danish locale and an encoding attached to that locale (utf8,
> > iso
> > > 8859). The standard doesn't always makes the distinction - it should)
> >
> > wel, there ae danish encodings and japanese encodings - multiple encodings
> > suitable
> > for danish or japanese, and the specific encoding is as you wrote
> > attached to the locale
> >
>
> Let say you have a neighbor called bjørn.
> Is "I am going to see bjørn" not English?
> Restricting a language to a limited character set is not matching the
> reality. Ergo the idea that a given character set is suitable for a locale
> is a bit bogus. Encodings are attached to a character set. And non-Unicode
> systems tend to conflate everything. Doesn't make it sensible !

well, restricting the code to a limited character set like iso-8859-15
can remove some problems with having strange characters, and remove security issues
like greek or cyrillic letters in identifiers. iso-8859-1 has served me well for many years.
>
>
> > > But yeah, all of that precludes people to have non ASCII in there source
> > as
> > > this is currently the only thing that will work portably.
> >
> > well we worked hard for c++ to have portable source code with non-ascii
> > characters,
> > and I believe we succeeded
> >
>
> I guess you never used windows?

I have not done much programming on window systems. I sometimes lived on administering them.
What are the problems wrt to this?
> >
> > >
> > > This is not inherent to C++ which is one reason other languages converged
> > > to utf8 as the default/only encoding.
> > > (The primary reason being the Unicode character set is actually useful to
> > > store text)
> >
> >
> > we did come up with solutions that were non-unicode - unicode is not
> > always useful,
> > I cannot read chinese nor arabic, but I can use symbolic characters in a
> > portable way and ensure
> > they are correct and portable, eg. author's names. And we made it happen
> > for many SC22 programming
> > languages, via work in SC22/WG20
> >
>
> Arabic and Chinese alone totally over 1.6 billions people.

yes at least. what I say is that I can maintain code with characters that I cannot read - nor write -
or even display, with the mechanisms that we have for portable i18n code in the c++ std-

keld
>
> > keld
> > >
> > >
> > > > keld
> > > >
> > > > On Tue, Aug 13, 2019 at 04:10:29PM -0400, Steve Downey wrote:
> > > > > Getting back to the original question. I think execution character
> > set
> > > > and
> > > > > execution encoding would refer to the encoding specified by the
> > default
> > > > > locale, the "C" locale. We do not change the execution encoding via
> > calls
> > > > > to setlocale(), we change the global default locale to a new locale.
> > > > >
> > > > > Any name is going to be confusing. I think it's better to just get an
> > > > > explicit definition to go together with the term. Something like
> > that the
> > > > > execution encoding is the same as the default character set
> > associated
> > > > with
> > > > > the default "C" locale, and that it is IF NDR if the actual default
> > > > > character set is different than the presumed translation from source
> > > > > encoding to execution encoding, or if translation units with
> > different
> > > > > execution encodings are linked together. IF NDR because I don't see
> > how
> > > > it
> > > > > could always be detected but it can quickly turn into ODR violations
> > > > where
> > > > > the same named object has different definitions.
> > > > >
> > > > > On Tue, Aug 13, 2019 at 1:22 PM Corentin <corentin.jabot_at_[hidden]>
> > > > wrote:
> > > > >
> > > > > >
> > > > > >
> > > > > > On Tue, Aug 13, 2019, 7:08 PM Thiago Macieira <thiago_at_[hidden]
> > >
> > > > wrote:
> > > > > >
> > > > > >> On Tuesday, 13 August 2019 09:55:07 PDT Corentin wrote:
> > > > > >> > (if anyone is thinking about that, I don't recommend it. You're
> > > > going
> > > > > >> to run
> > > > > >> > into size limits: ICC at 512kB and MSVC at 256kB. Use something
> > like
> > > > > >> xxd -i
> > > > > >> > to generate a brace-delimited array instead)
> > > > > >> >
> > > > > >> > Afaik that works if you use \x to escape every byte otherwise
> > some
> > > > > >> > implementation will mess with your data. Nothing is guaranteed
> > to be
> > > > > >> > passthrough otherwise
> > > > > >>
> > > > > >> That would be ideal, but the problem I had was the unavailability
> > of
> > > > > >> proper
> > > > > >> tools to convert the input into a form that the C++ compiler could
> > > > > >> consume. I
> > > > > >> was trying to do with a simple concatenation of a header, data,
> > and
> > > > > >> footer.
> > > > > >>
> > > > > >> The end result is a shell script, a Perl script and a powershell
> > > > script:
> > > > > >> https://codereview.qt-project.org/c/qt/qtbase/+/263548
> > > > > >
> > > > > >
> > > > > > Interesting ! std::embed could be useful there (we are going a bit
> > off
> > > > > > script). Some kind of raw bytes literals or an implementation that
> > > > would
> > > > > > optimize parsing arrays of literals such that it is as efficient at
> > > > compile
> > > > > > time as strings would also be nice.
> > > > > >
> > > > > >>
> > > > > >> --
> > > > > >> Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
> > > > > >> Software Architect - Intel System Software Products
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> _______________________________________________
> > > > > > SG16 Unicode mailing list
> > > > > > Unicode_at_[hidden]
> > > > > > http://www.open-std.org/mailman/listinfo/unicode
> > > > > >
> > > >
> > > > > _______________________________________________
> > > > > SG16 Unicode mailing list
> > > > > Unicode_at_[hidden]
> > > > > http://www.open-std.org/mailman/listinfo/unicode
> > > >
> > > >
> >

Received on 2019-08-14 00:29:29