Date: Wed, 14 Aug 2019 18:24:11 +0000
>Far more importantly, if the committee can assume unicode-clean source code going forth, that makes far more tractable lots of other problems such as how char string literals ought to be interpreted.
I don't think this actually matters for implementations. The standard can describe what happens for Unicode and let implementations figure out what that means for the legacy encodings they target. An implementation on an EBCDIC machine, for example, can do an 'as if' notional conversion into UTF-8 for the purposes of following the standard's rules.
(I've been saying we should use IEEE 754 language for floats even though some machines don't have that for years; this is very similar; describe the behavior you want and let implementations with special considerations get as close to that as is practical)
The bigger problem is what happens to puts("some string literal") on such an EBCDIC machine if the terminal is expected to not be UTF-8, or comparisons with argv when it is not UTF-8 🙂.
>The present implementation-defined interpretation of the byte sequence in
>source files allows a default of "UTF-8 in strings, comments can use
>arbitrary bytes" (which thus allows existing source files in a range of
>ASCII-compatible 8-bit character sets if the non-ASCII characters only
>appear in comments, without needing to tell the compiler which character
>set is being used). That approach (which is what GCC does by default)
>seems more friendly to users with existing source files using various
>character sets in comments than strictly requiring everything to be UTF-8
>(even in comments) unless the compiler is explicitly told otherwise.
I don't think GCC's behavior here would be prevented by the standard describing the program input in terms of UTF-8.
Billy3
________________________________
From: Liaison <liaison-bounces_at_lists.isocpp.org> on behalf of Niall Douglas via Liaison <liaison_at_[hidden]>
Sent: Wednesday, August 14, 2019 10:36 AM
To: Niall Douglas via Liaison <liaison_at_[hidden]>
Cc: Niall Douglas <s_sourceforge_at_[hidden]>; unicode_at_[hidden] <unicode_at_[hidden]>
Subject: Re: [wg14/wg21 liaison] [isocpp-core] Source file encoding (was: What is the proper term for the locale dependent run-time character set/encoding used for the character classification and conversion functions?)
> The present implementation-defined interpretation of the byte sequence in
> source files allows a default of "UTF-8 in strings, comments can use
> arbitrary bytes" (which thus allows existing source files in a range of
> ASCII-compatible 8-bit character sets if the non-ASCII characters only
> appear in comments, without needing to tell the compiler which character
> set is being used). That approach (which is what GCC does by default)
> seems more friendly to users with existing source files using various
> character sets in comments than strictly requiring everything to be UTF-8
> (even in comments) unless the compiler is explicitly told otherwise.
I would find that choice unhelpful for tooling which processes C++
source code. e.g. Python, which insists that text you feed it is either
correct, or not text. And that's not unreasonable, either text is
encoded correctly, or it is not.
What do you think of my "all 7-bit clean ASCII" proposal? #pragma
encoding (if supported by your C compiler) to opt out.
Niall
_______________________________________________
Liaison mailing list
Liaison_at_[hidden]
Subscription: https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.isocpp.org%2Fmailman%2Flistinfo.cgi%2Fliaison&data=02%7C01%7Cbion%40microsoft.com%7C8603993bd2154496bc8f08d720dde475%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637014009707838598&sdata=UWxrCeCFV5eCCyo%2FtDtsghMRCc9qtZVg6zKzH0dWA90%3D&reserved=0
Link to this post: https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.isocpp.org%2Fliaison%2F2019%2F08%2F0022.php&data=02%7C01%7Cbion%40microsoft.com%7C8603993bd2154496bc8f08d720dde475%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637014009707838598&sdata=4BLCv%2FeepKePMWRaf6Da2IvGWIiZAhDBblsuju%2BOWGU%3D&reserved=0
I don't think this actually matters for implementations. The standard can describe what happens for Unicode and let implementations figure out what that means for the legacy encodings they target. An implementation on an EBCDIC machine, for example, can do an 'as if' notional conversion into UTF-8 for the purposes of following the standard's rules.
(I've been saying we should use IEEE 754 language for floats even though some machines don't have that for years; this is very similar; describe the behavior you want and let implementations with special considerations get as close to that as is practical)
The bigger problem is what happens to puts("some string literal") on such an EBCDIC machine if the terminal is expected to not be UTF-8, or comparisons with argv when it is not UTF-8 🙂.
>The present implementation-defined interpretation of the byte sequence in
>source files allows a default of "UTF-8 in strings, comments can use
>arbitrary bytes" (which thus allows existing source files in a range of
>ASCII-compatible 8-bit character sets if the non-ASCII characters only
>appear in comments, without needing to tell the compiler which character
>set is being used). That approach (which is what GCC does by default)
>seems more friendly to users with existing source files using various
>character sets in comments than strictly requiring everything to be UTF-8
>(even in comments) unless the compiler is explicitly told otherwise.
I don't think GCC's behavior here would be prevented by the standard describing the program input in terms of UTF-8.
Billy3
________________________________
From: Liaison <liaison-bounces_at_lists.isocpp.org> on behalf of Niall Douglas via Liaison <liaison_at_[hidden]>
Sent: Wednesday, August 14, 2019 10:36 AM
To: Niall Douglas via Liaison <liaison_at_[hidden]>
Cc: Niall Douglas <s_sourceforge_at_[hidden]>; unicode_at_[hidden] <unicode_at_[hidden]>
Subject: Re: [wg14/wg21 liaison] [isocpp-core] Source file encoding (was: What is the proper term for the locale dependent run-time character set/encoding used for the character classification and conversion functions?)
> The present implementation-defined interpretation of the byte sequence in
> source files allows a default of "UTF-8 in strings, comments can use
> arbitrary bytes" (which thus allows existing source files in a range of
> ASCII-compatible 8-bit character sets if the non-ASCII characters only
> appear in comments, without needing to tell the compiler which character
> set is being used). That approach (which is what GCC does by default)
> seems more friendly to users with existing source files using various
> character sets in comments than strictly requiring everything to be UTF-8
> (even in comments) unless the compiler is explicitly told otherwise.
I would find that choice unhelpful for tooling which processes C++
source code. e.g. Python, which insists that text you feed it is either
correct, or not text. And that's not unreasonable, either text is
encoded correctly, or it is not.
What do you think of my "all 7-bit clean ASCII" proposal? #pragma
encoding (if supported by your C compiler) to opt out.
Niall
_______________________________________________
Liaison mailing list
Liaison_at_[hidden]
Subscription: https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.isocpp.org%2Fmailman%2Flistinfo.cgi%2Fliaison&data=02%7C01%7Cbion%40microsoft.com%7C8603993bd2154496bc8f08d720dde475%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637014009707838598&sdata=UWxrCeCFV5eCCyo%2FtDtsghMRCc9qtZVg6zKzH0dWA90%3D&reserved=0
Link to this post: https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.isocpp.org%2Fliaison%2F2019%2F08%2F0022.php&data=02%7C01%7Cbion%40microsoft.com%7C8603993bd2154496bc8f08d720dde475%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637014009707838598&sdata=4BLCv%2FeepKePMWRaf6Da2IvGWIiZAhDBblsuju%2BOWGU%3D&reserved=0
Received on 2019-08-14 13:26:13