On 5/8/24 6:16 PM, Tiago Freire wrote:

 

> UTF-8 solves problems with mojibake. It does not solve problems with translations. Let's go back to a variation of an example I gave earlier that uses a hypothetical message catalog similar to GNU gettext() to provide translations of strings in UTF-8 in char8_t.

> std::cout << u8msg("In the month of ") << std::chrono::August << "\n";

 

No, utf-8 doesn’t solve the problem of mojibake. It’s an encoding like any other, and the reason why you get mojibake is because you expect it to be one encoding when it is another.

You're right of course; I didn't state what I meant well. What I meant was that UTF encodings solve the problem of requiring multiple encodings (or a shift state encoding) in order to be able to represent certain combinations of characters.

Take your example of std::cout, even if you enforce a transcoding from utf-8 to occur, terminals that can change their encoding at runtime are a thing, and sometimes you can not even predict what encoding is going to be on the other end of that IPC. That is foregone, there is no solving that.

 

That is why I think this model:

> input encoding -> (program uses intermediate UTF-8 throughout) -> output encoding

 

is misguided, that middle step doesn’t actually solve anything, it just introduces an extra middleman where more things can go wrong.

Whatever rule you want to define for that middleman could have just as well had been applied to your input and it would work just the same.

The production of that last “output encoding” is what you ultimately care about, and it is not an interpretation problem, it’s a transcoding problem.

If you can do A->B->C you can do A->C, but being able to do A->C doesn’t imply the ability to do A->B->C, so why even bother?.

Use of a UTF encoding as the intermediate encoding enables transformation and operations on all inputs without having to track an associated encoding throughout the program.

 

In any case I don’t think conversations around “locale” are productive or even on point. Formatting and encoding are not the same thing, and these should not be confused.

Historically, locale and encoding have been inseparable and continue to be intertwined on almost all operating systems (I think macOS might be the only exception? Perhaps one or more of the BSDs?). The reason for this discussion is because iostreams consults a locale by default and produces text in the locale encoding.

Tom.

 

Peter makes a good point, there are many competing code pages that can exist simultaneously in one system. But my conclusion is different.

Those must be supported, and you shouldn’t break it. The solution isn’t let’s just focus on unspecified stuff and make it all utf-8, the solution is “the user knows what they want to do let the user decide”. It’s less “a specific encoding” (i.e. utf-8) and not more of it.