C++ Logo


Advanced search

Re: [SG16-Unicode] [isocpp-core] Source file encoding (was: What is the proper term for the locale dependent run-time character set/encoding used for the character classification and conversion functions?)

From: Corentin <corentin.jabot_at_[hidden]>
Date: Wed, 14 Aug 2019 13:07:13 +0200
On Wed, Aug 14, 2019, 12:39 PM Niall Douglas <s_sourceforge_at_[hidden]>

> Removed CC to Core, as per Tom's request.
> > I agree with you that reinterpreting all existing code overnight as
> > utf-8 would hinder the adoption of future c++ version enough that we
> > should probably avoid to do that, but maybe a slight encouragement to
> > use utf8 would be beneficial to everyone.
> I don't personally think it's a big ask for people to convert their
> source files into UTF-8 when they flip the compiler language standard
> version into C++ 23, *if they don't tell the compiler to interpret the
> source code in a different way*. As I mentioned in a previous post, even
> very complex multi-encoded legacy codebases can be upgraded via Python.
> Just invest the effort, upgrade your code, clear the tech debt. Same as
> everyone must do with every C++ standard version upgrade.
> Far more importantly, if the committee can assume unicode-clean source
> code going forth, that makes far more tractable lots of other problems
> such as how char string literals ought to be interpreted.
> Right now there is conflation in this discussion between two types of
> char string:

I don't think people (at least sg 16) are confused. The standard does
conflates everything. I think that's why Tom asked about the names of these
things to begin with.

1. char strings which come from the runtime environment e.g. from
> argv[], which can be ANY arbitrary encoding, including arbitrary bits.
> 2. char strings which come from the compile time environment with
> compiler-imposed expectations of encoding e.g. from __FILE__
> 3. char strings which come from the compiler time environment with
> arbitrary encoding and bits e.g. escaped characters inside string literals.

2 and 3 will have the same encoding. (Which will uterly fail when we try
to introduce Unicode identifiers and reflection).

> This conflation is not helping the discussion get anywhere useful
> quickly. For example, one obvious solution to the above is that string
> literals gain a type of char8_maybe_t if they don't contain anything
> UTF-8 unsafe, and char8_maybe_t can implicitly convert to char8_t or to
> char.

Maybe we have enough literal types

> Various people have objected to my proposal on strawman grounds e.g. "my
> code would break". Firstly, if that is the case, your code is probably
> *already* broken, and "just happens" to work on your particular
> toolchain version. It won't be portable, in any case.

Agreed. But whey I say these kinds of things people make funny faces. And
get annoyed to be pointed the brokeness of their code/the standard. So this
option seems out. Especially on windows where the system is not utf8

> Secondly, as Tom suggested, some sort of #pragma to indicate encoding is
> probably unavoidable in the long run in any case, because the
> preprocessor also needs to know encoding. Anybody who has wrestled with
> files #including files of differing encoding, but insufficiently
> different that the compiler can't auto-detect the disparate encoding,
> will know what I mean. Far worse happens again when macros with content
> from one encoding are expanded into files with different encoding.

I don't see how the preprocessor factors into that, the mapping to internal
encoding is done before.

Also pragma doesn't help you mixing ebcdic and ASCII supersets

> The current situation of letting everybody do what they want is a mess.

Strongly agree.

That's what standardisation is for: imposition of order upon chaos.
> Just make the entire lot UTF-8! And let individual files opt-out if they
> want, or whole TUs if the user asks the compiler to do so, with the
> standard making it very clear that anything other than UTF-8 =
> implementation defined behaviour for C++ 23 onwards.

That is the pragmatic long term solution. But not the pragmatic short term
one. Wg21 favors the later it seems.

I would support such a thing. All other languages went there and it works
great for them. Python will for example assume utf8 in the absence of

> Niall
> _______________________________________________
> SG16 Unicode mailing list
> Unicode_at_[hidden]
> http://www.open-std.org/mailman/listinfo/unicode

Received on 2019-08-14 13:07:27