On Tue, Aug 13, 2019, 11:35 PM <keld@keldix.com> wrote:
On Tue, Aug 13, 2019 at 10:49:09PM +0200, Corentin wrote:
> On Tue, Aug 13, 2019, 10:34 PM <keld@keldix.com> wrote:
>
> > For most programs there is no default execution character set nor default
> > execution encoding. A binary program is designed to run with the run time
> > execution character set of the locale it runs with. So the same binary
> > ??ogram can run with a Japanese encoding or a Danish enoding or arabic
> > encoding.
> > There is no knowledge at compilation time what encoding will be used at
> > run time
> >
>
> The standard assumes there is one. It has to. You cannot not have an
> encoding.
> (Of  course it is broken but it's a very old assumption).

that encoding is then probably the same as the compile time encoding.


You have no control over that no ne does. The compiler will select an encoding for literals. Then this literals might be interpreted by ie iostream using an encoding derived from the system locale ( :s) either they are the same or they are not, in which case in.

the compiler cannot know what the encoding of the execution will be and the information of what encoding the compiler chose it's not stored.

> Also there is no such thing as a Danish encoding or a Japanese encoding.
> There is a Danish locale and an encoding attached to that locale (utf8, iso
> 8859). The standard doesn't always makes the distinction - it should)

wel, there ae danish encodings and japanese encodings - multiple encodings suitable
for danish or japanese, and the specific encoding is  as you wrote
attached to the locale

Let say you have a neighbor called bjørn.
Is "I am going to see bjørn" not English?
Restricting a language to a limited character set is not matching the reality. Ergo the idea that a given character set is suitable for a locale is a bit bogus. Encodings are attached to a character set. And non-Unicode systems tend to conflate everything. Doesn't make it sensible !



> But yeah, all of that precludes people to have non ASCII in there source as
> this is currently the only thing that will work portably.

well we worked hard for c++ to have portable source code with non-ascii characters,
and I believe we succeeded

I guess you never used windows?

>
> This is not inherent to C++ which is one reason other languages converged
> to utf8 as the default/only encoding.
> (The primary reason being the Unicode character set is actually useful to
> store text)


we did come up with solutions that were non-unicode - unicode is not always useful,
I cannot read chinese nor arabic, but I can use symbolic characters in a portable way and ensure
they are correct and portable, eg. author's names. And we made it happen for many SC22 programming
languages, via work in SC22/WG20

Arabic and Chinese alone totally over 1.6 billions people.


keld
>
>
> > keld
> >
> > On Tue, Aug 13, 2019 at 04:10:29PM -0400, Steve Downey wrote:
> > > Getting back to the original question. I think execution character set
> > and
> > > execution encoding would refer to the encoding specified by the default
> > > locale, the "C" locale. We do not change the execution encoding via calls
> > > to setlocale(), we change the global default locale to a new locale.
> > >
> > > Any name is going to be confusing. I think it's better to just get an
> > > explicit definition to go together with the term. Something like that the
> > > execution encoding is the same as the default character set associated
> > with
> > > the default "C" locale, and that it is IF NDR if the actual default
> > > character set is different than the presumed translation from source
> > > encoding to execution encoding, or if translation units with different
> > > execution encodings are linked together.  IF NDR because I don't see how
> > it
> > > could always be detected but it can quickly turn into ODR violations
> > where
> > > the same named object has different definitions.
> > >
> > > On Tue, Aug 13, 2019 at 1:22 PM Corentin <corentin.jabot@gmail.com>
> > wrote:
> > >
> > > >
> > > >
> > > > On Tue, Aug 13, 2019, 7:08 PM Thiago Macieira <thiago@macieira.org>
> > wrote:
> > > >
> > > >> On Tuesday, 13 August 2019 09:55:07 PDT Corentin wrote:
> > > >> > (if anyone is thinking about that, I don't recommend it. You're
> > going
> > > >> to run
> > > >> > into size limits: ICC at 512kB and MSVC at 256kB. Use something like
> > > >> xxd -i
> > > >> > to generate a brace-delimited array instead)
> > > >> >
> > > >> > Afaik that works if you use \x to escape every byte otherwise some
> > > >> > implementation will mess with your data. Nothing is guaranteed to be
> > > >> > passthrough otherwise
> > > >>
> > > >> That would be ideal, but the problem I had was the unavailability of
> > > >> proper
> > > >> tools to convert the input into a form that the C++ compiler could
> > > >> consume. I
> > > >> was trying to do with a simple concatenation of a header, data, and
> > > >> footer.
> > > >>
> > > >> The end result is a shell script, a Perl script and a powershell
> > script:
> > > >>         https://codereview.qt-project.org/c/qt/qtbase/+/263548
> > > >
> > > >
> > > > Interesting ! std::embed could be useful there (we are going a bit off
> > > > script). Some kind of raw bytes literals or an implementation that
> > would
> > > > optimize parsing arrays of literals such that it is as efficient at
> > compile
> > > > time as strings would also be nice.
> > > >
> > > >>
> > > >> --
> > > >> Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
> > > >>    Software Architect - Intel System Software Products
> > > >>
> > > >>
> > > >>
> > > >> _______________________________________________
> > > > SG16 Unicode mailing list
> > > > Unicode@isocpp.open-std.org
> > > > http://www.open-std.org/mailman/listinfo/unicode
> > > >
> >
> > > _______________________________________________
> > > SG16 Unicode mailing list
> > > Unicode@isocpp.open-std.org
> > > http://www.open-std.org/mailman/listinfo/unicode
> >
> >