C++ Logo

sg16

Advanced search

Re: [SG16] A UTF-8 environment specification; an alternative to assuming UTF-8 based on choice of literal encoding

From: Charlie Barto <Charles.Barto_at_[hidden]>
Date: Thu, 29 Jul 2021 20:07:04 +0000
> That statement is misleading because it's mixing two things.

I realized this about 20 minutes after sending the message 😭 (prepare for my emojis to be munged by outlook).

I think the standard should specify parameters are a sequence of zero terminated byte strings, not containing embedded nul characters, in some platform specific encoding (which is essentially what we say now for "main").

If we want to allow zero bytes within arguments then the interface for that needs to be length prefixed for each argument (obviously). However this should probably be something you can ask for and would require a lot of thought (for example, on linux you can get this to work by sending embedded zeros as zero length dummy arguments). Supporting embedded nuls isn't a super great thing to spend our time on, but it's a fun exercise.

Upon reflection I don't think WTF-8 is right for unix, I think for unix the (possibly implementation defined) behavior should be "an array of zero terminated byte strings, that don't contain 0x0". Not all byte strings are valid WTF-8, if you're on unix and need to transcode to UTF-16 with round tripping I think you need something like PEP-383, instead (notably the system need not know about your pep-383 things).

WTF-8 is reasonable on windows because it round trips, so, for example, if argv was transcoded to wtf-8 strings I could still write a windows program that took arguments as a utf-8 char array _cast_ to whcar_t* (possibly adding one byte). If we speficied that transcoding happened on windows then the crt would assume that was possibly ill-formed utf-16, and transcode it to wtf-8 (so 'AB', which was 0x4142 (not a C numeric literal, note big endian) in the actual process arguments would be converted to 0xe48582 for main, this is reversable, and the implementation could provide a function like:
unsigned char* main_to_system(unsigned char*); to allow getting back to the actual bytes passed in.

> The problem is that the way you wrote, it makes it sound like WTF-8 can be used to hold invalid file paths on POSIX systems and round-trip those to UTF-16. That doesn't work. Therefore, any cross-platform content that attempts to transcode to UTF-16 will have to deal with undecodeable paths any way.

I think this is true for WTF-8 on platforms where the parameters can be arbitrary byte strings, but I don't think it's true in general. I think there are probably transcoding algorithms that will take valid utf-8 to equivalent, valid utf-16, and the reverse while also round tripping for all invalid values. PEP-383 might be able to do this. We probably don't want to invent a new encoding and apply it at startup 😊.


-----Original Message-----
From: SG16 <sg16-bounces_at_lists.isocpp.org> On Behalf Of Thiago Macieira via SG16
Sent: Thursday, July 29, 2021 8:12 AM
To: sg16_at_lists.isocpp.org
Cc: Thiago Macieira <thiago_at_macieira.org>
Subject: Re: [SG16] A UTF-8 environment specification; an alternative to assuming UTF-8 based on choice of literal encoding

On Wednesday, 28 July 2021 16:09:32 PDT Charlie Barto via SG16 wrote:
> > int main(int argc, char8_t** args, char8_t** env)
>
> Yeah I think anything like this should be specified to be WTF-8, even
> on posix making them actual utf-8 would break file path arguments.
> With WTF-8 you can round trip to the original sequence of potentially
> ill formed
> utf-16 code units.

That statement is misleading because it's mixing two things.

You're saying it should be WTF-8 because on Windows, it can be used to hold improperly-encoded UTF-16 file paths.

And you're saying that because it would be WTF-8 on Windows, it should be
WTF-8 on POSIX systems too.

Both suggestions are fine. I agree with them.

The problem is that the way you wrote, it makes it sound like WTF-8 can be used to hold invalid file paths on POSIX systems and round-trip those to UTF-16. That doesn't work. Therefore, any cross-platform content that attempts to transcode to UTF-16 will have to deal with undecodeable paths any way.

--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
   Software Architect - Intel DPG Cloud Engineering



--
SG16 mailing list
SG16_at_[hidden]
https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.isocpp.org%2Fmailman%2Flistinfo.cgi%2Fsg16&amp;data=04%7C01%7CCharles.Barto%40microsoft.com%7Cf1c6fe9854984c03798608d952a340e1%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637631683398906328%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=ZUV4xVu7UxlJ9xeqNAV7%2BagdFHZh5tX8gzdRFpiKuRA%3D&amp;reserved=0

Received on 2021-07-29 15:07:09