sg16: Re: [SG16-Unicode] [isocpp-core] What is the proper term for the locale dependent run-time character set/encoding used for the character classification and conversion functions?

From: keld_at <keld_at_[hidden]>
Date: Tue, 13 Aug 2019 23:35:59 +0200

On Tue, Aug 13, 2019 at 10:49:09PM +0200, Corentin wrote:
> On Tue, Aug 13, 2019, 10:34 PM <keld_at_[hidden]> wrote:
>
> > For most programs there is no default execution character set nor default
> > execution encoding. A binary program is designed to run with the run time
> > execution character set of the locale it runs with. So the same binary
> > ??ogram can run with a Japanese encoding or a Danish enoding or arabic
> > encoding.
> > There is no knowledge at compilation time what encoding will be used at
> > run time
> >
>
> The standard assumes there is one. It has to. You cannot not have an
> encoding.
> (Of course it is broken but it's a very old assumption).

that encoding is then probably the same as the compile time encoding.

> Also there is no such thing as a Danish encoding or a Japanese encoding.
> There is a Danish locale and an encoding attached to that locale (utf8, iso
> 8859). The standard doesn't always makes the distinction - it should)

wel, there ae danish encodings and japanese encodings - multiple encodings suitable
for danish or japanese, and the specific encoding is as you wrote
attached to the locale

> But yeah, all of that precludes people to have non ASCII in there source as
> this is currently the only thing that will work portably.

well we worked hard for c++ to have portable source code with non-ascii characters,
and I believe we succeded.

>
> This is not inherent to C++ which is one reason other languages converged
> to utf8 as the default/only encoding.
> (The primary reason being the Unicode character set is actually useful to
> store text)

we did come up with solutions that were non-unicode - unicode is not always useful,
I cannot read chinese nor arabic, but I can use symbolic characters in a portable way and ensure
they are correct and portable, eg. author's names. And we made it happen for many SC22 programming
languages, via work in SC22/WG20

keld
>
>
> > keld
> >
> > On Tue, Aug 13, 2019 at 04:10:29PM -0400, Steve Downey wrote:
> > > Getting back to the original question. I think execution character set
> > and
> > > execution encoding would refer to the encoding specified by the default
> > > locale, the "C" locale. We do not change the execution encoding via calls
> > > to setlocale(), we change the global default locale to a new locale.
> > >
> > > Any name is going to be confusing. I think it's better to just get an
> > > explicit definition to go together with the term. Something like that the
> > > execution encoding is the same as the default character set associated
> > with
> > > the default "C" locale, and that it is IF NDR if the actual default
> > > character set is different than the presumed translation from source
> > > encoding to execution encoding, or if translation units with different
> > > execution encodings are linked together. IF NDR because I don't see how
> > it
> > > could always be detected but it can quickly turn into ODR violations
> > where
> > > the same named object has different definitions.
> > >
> > > On Tue, Aug 13, 2019 at 1:22 PM Corentin <corentin.jabot_at_[hidden]>
> > wrote:
> > >
> > > >
> > > >
> > > > On Tue, Aug 13, 2019, 7:08 PM Thiago Macieira <thiago_at_[hidden]>
> > wrote:
> > > >
> > > >> On Tuesday, 13 August 2019 09:55:07 PDT Corentin wrote:
> > > >> > (if anyone is thinking about that, I don't recommend it. You're
> > going
> > > >> to run
> > > >> > into size limits: ICC at 512kB and MSVC at 256kB. Use something like
> > > >> xxd -i
> > > >> > to generate a brace-delimited array instead)
> > > >> >
> > > >> > Afaik that works if you use \x to escape every byte otherwise some
> > > >> > implementation will mess with your data. Nothing is guaranteed to be
> > > >> > passthrough otherwise
> > > >>
> > > >> That would be ideal, but the problem I had was the unavailability of
> > > >> proper
> > > >> tools to convert the input into a form that the C++ compiler could
> > > >> consume. I
> > > >> was trying to do with a simple concatenation of a header, data, and
> > > >> footer.
> > > >>
> > > >> The end result is a shell script, a Perl script and a powershell
> > script:
> > > >> https://codereview.qt-project.org/c/qt/qtbase/+/263548
> > > >
> > > >
> > > > Interesting ! std::embed could be useful there (we are going a bit off
> > > > script). Some kind of raw bytes literals or an implementation that
> > would
> > > > optimize parsing arrays of literals such that it is as efficient at
> > compile
> > > > time as strings would also be nice.
> > > >
> > > >>
> > > >> --
> > > >> Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
> > > >> Software Architect - Intel System Software Products
> > > >>
> > > >>
> > > >>
> > > >> _______________________________________________
> > > > SG16 Unicode mailing list
> > > > Unicode_at_[hidden]
> > > > http://www.open-std.org/mailman/listinfo/unicode
> > > >
> >
> > > _______________________________________________
> > > SG16 Unicode mailing list
> > > Unicode_at_[hidden]
> > > http://www.open-std.org/mailman/listinfo/unicode
> >
> >

Received on 2019-08-13 23:35:59