On 9/1/20 2:14 PM, Aaron Ballman via SG16 wrote:
> On Tue, Sep 1, 2020 at 1:16 PM Martinho Fernandes via SG16
> <email@example.com> wrote:
>> On Tue, Sep 1, 2020 at 7:05 PM Aaron Ballman via SG16 <firstname.lastname@example.org> wrote:
>>> On Tue, Sep 1, 2020 at 12:08 PM Alisdair Meredith via SG16
>>> <email@example.com> wrote:
>>>> For a cross compiler, the basic execution character set should correspond to the target platform, but the diagnostics character set should be for the host?
>>> That matches my understanding.
>>> I suppose a question I could add is whether anyone would like to see a
>>> new character set introduced for diagnostics. My intuition is that it
>>> would be a pretty heavy hammer to bring to bear and that the basic
>>> source character set is probably Good Enough (tm).
>> Wouldn't these diagnostics be the place people are more likely to use non-basic source characters, though? When it comes to identifiers people will sometimes compromise and restrict themselves and e.g. avoid diacritics, but in error messages I feel like it makes a lot more sense to want to write with the full expression of their native script.
> I think that's likely a valid point, but I'm struggling to find any
> data that I can point to in the paper. I don't suppose you (or anyone
> in SG16) have insights into how often this comes up in practice? e.g.,
> does someone have evidence that this comes up enough to warrant making
> a diagnostic character set? Does anyone think that would be a worse
> approach than limiting to the basic source character set?
I don't have any data regarding use in practice but my intuition matches
Assuming a diagnostic character set were specified and that it contained
characters beyond the basic source character set, what guarantees could
be offered? Presentation of the diagnostic may be limited by factors
outside the implementation's control; for example, terminal/console
capabilities. We could state that diagnostic messages must reflect all
message characters outside the basic source character set in some way,
whether via an escape mechanism, substitution of a replacement
character, or some other method.
Perhaps it is useful to think about this more abstractly. Translation
phase 1 doesn't place any restrictions on the program source. Likewise,
diagnostic messages need not have an associated concrete character set;
think of asking your smart speaker to compile a program and how it might
present diagnostics. Perhaps diagnostic generation should be defined as
the inverse of translation phase 1; the input is in the internal
character set (insert reversion of translation phase 5 as necessary) and
the output is analogous to "physical source file characters"
The encoding for diagnostic would have to be an implementation defined encoding.
We probably can't make guarantees about it beside "super set of basic character set"
It is still valuable that it would be decoupled from the execution character set.
And yes, I believe people might not want to restrict themselves to latin1. They might get replacement characters, that's seems reasonable