C++ Logo

sg16

Advanced search

Re: [isocpp-sg16] Follow up on SG16 review of P2996R2 (Reflection for C++26)

From: Tom Honermann <tom_at_[hidden]>
Date: Wed, 8 May 2024 19:04:44 -0400
On 5/8/24 6:16 PM, Tiago Freire wrote:
>
> > UTF-8 solves problems with mojibake. It does not solve problems with
> translations. Let's go back to a variation of an example I gave
> earlier that uses a hypothetical message catalog similar to GNU
> gettext() to provide translations of strings in UTF-8 in char8_t.
>
> > std::cout << u8msg("In the month of ") << std::chrono::August << "\n";
>
> No, utf-8 doesn’t solve the problem of mojibake. It’s an encoding like
> any other, and the reason why you get mojibake is because you expect
> it to be one encoding when it is another.
>
You're right of course; I didn't state what I meant well. What I meant
was that UTF encodings solve the problem of requiring multiple encodings
(or a shift state encoding) in order to be able to represent certain
combinations of characters.
>
> Take your example of std::cout, even if you enforce a transcoding from
> utf-8 to occur, terminals that can change their encoding at runtime
> are a thing, and sometimes you can not even predict what encoding is
> going to be on the other end of that IPC. That is foregone, there is
> no solving that.
>
> That is why I think this model:
>
> > input encoding -> (program uses intermediate UTF-8 throughout) ->
> output encoding
>
> is misguided, that middle step doesn’t actually solve anything, it
> just introduces an extra middleman where more things can go wrong.
>
> Whatever rule you want to define for that middleman could have just as
> well had been applied to your input and it would work just the same.
>
> The production of that last “output encoding” is what you ultimately
> care about, and it is not an interpretation problem, it’s a
> transcoding problem.
>
> If you can do A->B->C you can do A->C, but being able to do A->C
> doesn’t imply the ability to do A->B->C, so why even bother?.
>
Use of a UTF encoding as the intermediate encoding enables
transformation and operations on all inputs without having to track an
associated encoding throughout the program.
>
> In any case I don’t think conversations around “locale” are productive
> or even on point. Formatting and encoding are not the same thing, and
> these should not be confused.
>
Historically, locale and encoding have been inseparable and continue to
be intertwined on almost all operating systems (I think macOS might be
the only exception? Perhaps one or more of the BSDs?). The reason for
this discussion is because iostreams consults a locale by default and
produces text in the locale encoding.

Tom.

> Peter makes a good point, there are many competing code pages that can
> exist simultaneously in one system. But my conclusion is different.
>
> Those must be supported, and you shouldn’t break it. The solution
> isn’t let’s just focus on unspecified stuff and make it all utf-8, the
> solution is “the user knows what they want to do let the user decide”.
> It’s less “a specific encoding” (i.e. utf-8) and not more of it.
>

Received on 2024-05-08 23:04:51