C++ Logo

sg16

Advanced search

Re: [SG16-Unicode] [isocpp-core] What is the proper term for the locale dependent run-time character set/encoding used for the character classification and conversion functions?

From: Steve Downey <sdowney_at_[hidden]>
Date: Tue, 13 Aug 2019 16:48:59 -0400
Unfortunately, the standard currently says otherwise. [lex.phases] 5 says,
for example:
Each basic source character set member in a character literal or a string
literal, as well as each escape sequence and *universal-character-name
<http://eel.is/c++draft/lex#nt:universal-character-name>* in a character
literal or a non-raw string literal, is converted to the corresponding
member of the execution character set ([lex.ccon]
<http://eel.is/c++draft/lex#ccon>, [lex.string]
<http://eel.is/c++draft/lex#string>); if there is no corresponding member,
it is converted to an implementation-defined member other than the null
(wide) character. <http://eel.is/c++draft/lex#phases-1.5.sentence-1>8
<http://eel.is/c++draft/lex#footnote-8>
http://eel.is/c++draft/lex#phases-1.5

I agree, though, that avoiding the execution character set and current
locale is very common. I do a lot of work on international text, and
everything has to be dealt with explicitly, although unicode and UTF-8 is
by far the easiest way.

Implicit is bad.

On Tue, Aug 13, 2019 at 4:34 PM <keld_at_[hidden]> wrote:

> For most programs there is no default execution character set nor default
> execution encoding. A binary program is designed to run with the run time
> execution character set of the locale it runs with. So the same binary
> řogram can run with a Japanese encoding or a Danish enoding or arabic
> encoding.
> There is no knowledge at compilation time what encoding will be used at
> run time.
>
>
> keld
>
> On Tue, Aug 13, 2019 at 04:10:29PM -0400, Steve Downey wrote:
> > Getting back to the original question. I think execution character set
> and
> > execution encoding would refer to the encoding specified by the default
> > locale, the "C" locale. We do not change the execution encoding via calls
> > to setlocale(), we change the global default locale to a new locale.
> >
> > Any name is going to be confusing. I think it's better to just get an
> > explicit definition to go together with the term. Something like that the
> > execution encoding is the same as the default character set associated
> with
> > the default "C" locale, and that it is IF NDR if the actual default
> > character set is different than the presumed translation from source
> > encoding to execution encoding, or if translation units with different
> > execution encodings are linked together. IF NDR because I don't see how
> it
> > could always be detected but it can quickly turn into ODR violations
> where
> > the same named object has different definitions.
> >
> > On Tue, Aug 13, 2019 at 1:22 PM Corentin <corentin.jabot_at_[hidden]>
> wrote:
> >
> > >
> > >
> > > On Tue, Aug 13, 2019, 7:08 PM Thiago Macieira <thiago_at_[hidden]>
> wrote:
> > >
> > >> On Tuesday, 13 August 2019 09:55:07 PDT Corentin wrote:
> > >> > (if anyone is thinking about that, I don't recommend it. You're
> going
> > >> to run
> > >> > into size limits: ICC at 512kB and MSVC at 256kB. Use something like
> > >> xxd -i
> > >> > to generate a brace-delimited array instead)
> > >> >
> > >> > Afaik that works if you use \x to escape every byte otherwise some
> > >> > implementation will mess with your data. Nothing is guaranteed to be
> > >> > passthrough otherwise
> > >>
> > >> That would be ideal, but the problem I had was the unavailability of
> > >> proper
> > >> tools to convert the input into a form that the C++ compiler could
> > >> consume. I
> > >> was trying to do with a simple concatenation of a header, data, and
> > >> footer.
> > >>
> > >> The end result is a shell script, a Perl script and a powershell
> script:
> > >> https://codereview.qt-project.org/c/qt/qtbase/+/263548
> > >
> > >
> > > Interesting ! std::embed could be useful there (we are going a bit off
> > > script). Some kind of raw bytes literals or an implementation that
> would
> > > optimize parsing arrays of literals such that it is as efficient at
> compile
> > > time as strings would also be nice.
> > >
> > >>
> > >> --
> > >> Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
> > >> Software Architect - Intel System Software Products
> > >>
> > >>
> > >>
> > >> _______________________________________________
> > > SG16 Unicode mailing list
> > > Unicode_at_[hidden]
> > > http://www.open-std.org/mailman/listinfo/unicode
> > >
>
> > _______________________________________________
> > SG16 Unicode mailing list
> > Unicode_at_[hidden]
> > http://www.open-std.org/mailman/listinfo/unicode
>
>

Received on 2019-08-13 22:49:12