C++ Logo

sg16

Advanced search

Re: Agenda for the 2024-04-24 SG16 meeting

From: Jens Maurer <jens.maurer_at_[hidden]>
Date: Sat, 20 Apr 2024 23:39:49 +0200
On 20/04/2024 22.12, Tiago Freire wrote:
>
>> There seem to be case-insensitive filesystems at least on Linux (for the better or worse), and those certainly need to be encoding-aware.
>> Also, my understanding is that Windows has case-insensitive filesystems by default.
>
> They do need to be encoding aware? Why? I propose it to you that you don't.

If the filesystem uses encoding X to determine case-insensitive
name-matching and your application is using encoding Y to name files,
it seems likely that the filesystem will consider some file names
case-insensitive matches, even though you would consider that
grossly incorrect in your application given your use of encoding Y.
For example, if the application creates two files named "a" and "z"
in some directory, getting a "file exists" error for the creation
of the "z" file, just because it happens to be a case-insensitive
match with "a" in the filesystem's encoding, would be very surprising.

That means to me that your application needs to be aware of
the filesystem using encoding X for its case-insensitive matches.
In this case, I gave a specific example of Linux using UTF-8
(and Unicode) for its case-insensitive matching, which you
seem to have stripped.

> Case-insensitive filesystems are a problem of a different nature.

They exist, and I was specifically responding to your claim
"despite the fact that no filesystem is Unicode" with a specific
example that actually "is Unicode".

> Even if the designer of the filesystem made best efforts to follow unicode, it can just be broken by a unicode update.

No. There is a stability guarantee:

"Stability. The definition of case folding is guaranteed to be stable, in that any string of
characters case folded according to these rules will remain case folded in Version 5.0 or
later of the Unicode Standard."

(bottom of page 241 here: https://www.unicode.org/versions/Unicode14.0.0/ch05.pdf )

>> Regardless of Unicode, you seem to agree that people should be able to express identifiers using non-English scripts.
>
> Yes, but. As far as I'm concerned this is an optional feature, you can have it, but we shouldn't explicitly support it.
> The encoding must prove itself to be fit for purpose, in this case for the purpose of expressing C++ code, if the encoding can't support it we shouldn't bend over backwards to accommodate it.
> I don't advocate to change key words like "if", "void", "int", or change the function names from the standard library to use other scripts.

I'm not aware of any proposal in that direction.

>> So, how would you express the semantics of universal-character-names, then?
>
> You don't. And you shouldn't. That's another terrible idea.

So, are you proposing to remove the feature "universal-character-name"
from the C++ standard? Or should the grammar production remain?
If the latter, what should it mean to say \u1234 in source code?
We already have \xAB to specify individual code units, if someone
needs that.

> Why should I be forced to format with unicode?

You are not forced to format with Unicode. You're identifying a particular
character from the Unicode character repertoire. Your implementation is
welcome to encode string literals as EBCDIC in the resulting binary,
or some 2-byte EBCDIC variant, and encode the code point identified
by number 0x1234 in the Unicode repertoire as whatever your chosen
encoding says this code point / character is supposed to be encoded
as. Maybe the encoded value ends up being 442.

Just because \u1234 identifies a particular character from the Unicode
repertoire doesn't mean anyone has to encode it as 0x1234, anywhere.

> Why are you telling me what code I should write instead of helping me write code?

I don't understand what you're trying to tell me here.

> That defeats the goal of trying to be encoding agnostic, now you can't work with a different encoding.

You can. Character repertoire and encoding are different things.

> C++ shouldn't be the language of Unicode. This needs to stop.
> It should allow the usage of unicode, it shouldn't require unicode to use.

It doesn't. You're welcome to ignore u8/u16/u32 string literals
and char8_t, char16_t, char32_t arrays if those don't fit your
use-cases.

>> Can you give specific examples where you believe a plausible identifier is prevented by the Unicode identifier rules?
>
> A zero-width-joiner with code point 0x200D which is suggested to be disallowed, could have a different meaning in a different encoding.

You seem to be confusing abstract characters with encoding values.

A zero-width-joiner stays a zero-width-joiner in the abstract,
regardless of which encoding or number you might use to represent it.
Conversely, the encoding value 0x200D can mean anything;
the C++ standard does not prescribe any meaning to the encoding
value 0x200D in char and wchar_t strings.

That's why (conceptually) [lex.phases] p1.1 translates incoming source
files to an abstract "translation character set" (see [lex.charset])
that doesn't (conceptually) have numbers for the characters anymore.
And only when we need to represent a sequence of such abstract
source file characters as numerical values (in particular, when
initializing an array with a string-literal) do we apply an
"encoding" step. For char and wchar_t, that encoding is entirely
implementation-defined, and your implementation is welcome to pick
any encoding it likes. See [lex.string].

>>> Adding Unicode support to the language should start by providing conversion functions that allows you to convert between different encodings, or that provide some checks or guarantees on runtime data that a user is free to not use if they want to use something else.
>> Agreed on that part. So, do you have specific concerns about the proposals for providing exactly those conversion functions, or at least a subset thereof?
>
> No, I agree, we should add those, we should totally facilitate users to be able to manipulate unicode data to their hearts content, and we should facilitate that as a library to users.
> And that is not as hard as a problem because there's nothing special about zero-width-joiners, or emoji, when manipulating text, and allot of the unicode evolution isn't even relevant here.

Right, that's why this is fairly low-hanging fruit, compared to,
say, collation algorithms.

> Just don't use unicode in the C++ lexer, don't make it the mandatory way of formatting things either.

I think in order to make progress here, you should present an outline
of specific changes you would like to see in the C++ standard text,
referring to specific subclauses and paragraphs.

Jens

Received on 2024-04-20 21:40:03