C++ Logo

sg16

Advanced search

Re: [SG16-Unicode] [isocpp-core] What is the proper term for the locale dependent run-time character set/encoding used for the character classification and conversion functions?

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Wed, 14 Aug 2019 14:43:00 +0200
On Wed, Aug 14, 2019, 2:31 PM Tom Honermann <tom_at_[hidden]> wrote:

> On 8/14/19 2:49 AM, Corentin Jabot via Core wrote:
>
>
>
> On Wed, Aug 14, 2019, 4:46 AM Tony V E <tvaneerd_at_[hidden]> wrote:
>
>>
>>
>> On Tue, Aug 13, 2019 at 8:57 AM Corentin Jabot <corentinjabot_at_[hidden]>
>> wrote:
>>
>>>
>>>
>>> On Tue, 13 Aug 2019 at 14:52, Ville Voutilainen <
>>> ville.voutilainen_at_[hidden]> wrote:
>>>
>>>> On Tue, 13 Aug 2019 at 15:35, Corentin Jabot via Core
>>>> <core_at_[hidden]> wrote:
>>>> >
>>>> >
>>>> > Chiming in with my favorite solution:> Forbid u8/u16/u32 literals in
>>>> non unicode encoded files
>>>>
>>>> But presumably not the ones that look like u8"\U1234" ?
>>>>
>>>
>>> Yes, there is no reason to disallow that as It can't be misinterpreted
>>> by neither the compiler or people (and quite a lot of code would needlessly
>>> break)
>>>
>>>
>> I find your lack of faith in people's ability to misinterpret something
>> disturbing.
>> :-)
>>
>
> 😁 (Challenging your mail client)
>
>
> \Uxxxx is unambiguous.
>
> u8"é" is ambiguous. Both people and the compiler may interpret that in a
> variety of ways. Notably if I have utf-8 in that file, which I wrote on
> Linux, but then the msvc compiler thinks it's windows 1252...
> Mojibake.
>
> There is no ambiguity there, just bog standard mojibake due to incorrect
> source file encoding assumptions. "é" has exactly the same set of
> "problems" as L"é", u8"é", u"é", and U"é".
>

Yes. People make assumptions, compilers make assumptions and voilà,
mojibake. Assuming that all parties involved have the same intent and
assumptions is the issue. Preventing wrong assumptions reduce the amount of
mojibake

>
>
> People also seem to be confused
>
>
> https://stackoverflow.com/questions/23471935/how-are-u8-literals-supposed-to-work
>
> Yes, that is a typical example of someone learning that source file
> encoding and execution encoding can be independently controlled. Note that
> the example even illustrates the individual being confused about handling
> of u8 literals and *then* becoming confused about handling of ordinary
> literals after learning about gcc's -finput-charset option (but
> apparently having not yet learned about gcc's -fexec-charset option).
>
Yes. I would make the bold claim (I don't have data) that most people are
confused about strings, even more so in the context of C++. The current
model makes it difficult to do the right thing and easy to create mojibake.


Tom.
>
>
>
>> --
>> Be seeing you,
>> Tony
>>
>
> _______________________________________________
> Core mailing listCore_at_[hidden]
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/core
> Link to this post: http://lists.isocpp.org/core/2019/08/7049.php
>
>
>

Received on 2019-08-14 14:43:14