Date: Wed, 8 May 2024 22:16:10 +0000
> UTF-8 solves problems with mojibake. It does not solve problems with translations. Let's go back to a variation of an example I gave earlier that uses a hypothetical message catalog similar to GNU gettext() to provide translations of strings in UTF-8 in char8_t.
> std::cout << u8msg("In the month of ") << std::chrono::August << "\n";
No, utf-8 doesn’t solve the problem of mojibake. It’s an encoding like any other, and the reason why you get mojibake is because you expect it to be one encoding when it is another.
Take your example of std::cout, even if you enforce a transcoding from utf-8 to occur, terminals that can change their encoding at runtime are a thing, and sometimes you can not even predict what encoding is going to be on the other end of that IPC. That is foregone, there is no solving that.
That is why I think this model:
> input encoding -> (program uses intermediate UTF-8 throughout) -> output encoding
is misguided, that middle step doesn’t actually solve anything, it just introduces an extra middleman where more things can go wrong.
Whatever rule you want to define for that middleman could have just as well had been applied to your input and it would work just the same.
The production of that last “output encoding” is what you ultimately care about, and it is not an interpretation problem, it’s a transcoding problem.
If you can do A->B->C you can do A->C, but being able to do A->C doesn’t imply the ability to do A->B->C, so why even bother?.
In any case I don’t think conversations around “locale” are productive or even on point. Formatting and encoding are not the same thing, and these should not be confused.
Peter makes a good point, there are many competing code pages that can exist simultaneously in one system. But my conclusion is different.
Those must be supported, and you shouldn’t break it. The solution isn’t let’s just focus on unspecified stuff and make it all utf-8, the solution is “the user knows what they want to do let the user decide”. It’s less “a specific encoding” (i.e. utf-8) and not more of it.
Received on 2024-05-08 22:16:13