C++ Logo

SG16

Advanced search

Subject: Re: Handling of non-basic characters in early translation phases
From: Corentin Jabot (corentinjabot_at_[hidden])
Date: 2020-06-20 11:32:06


On Sat, 20 Jun 2020 at 18:11, Hubert Tong <hubert.reinterpretcast_at_[hidden]>
wrote:

> On Sat, Jun 20, 2020 at 10:52 AM Jens Maurer via SG16 <
> sg16_at_[hidden]> wrote:
>
>> On 20/06/2020 15.47, Corentin Jabot wrote:
>>
>> > What's the implementation strategy for an implementation that
>> wishes to provide
>> > byte pass-through in string literals under your approach, which
>> tunnels everything
>> > through Unicode?
>> >
>> >
>> > As long as a source character can be converted to a Unicode character,
>> and that Unicode character can be converted back to the same
>> original character,
>> > does it matter if it was or not?
>> >
>> > If A1 -> B is a valid transcoding operation, then B -> A1 is a valid
>> transcoding operation whether there exists or not a separate B -> A2
>> transcoding operation.
>> > An implementation strategy would be to keep track of the original
>> character (maybe by lexing character by character in the source file
>> encoding), or use some other form of tracking. But does that strategy have
>> to be specified in the standard?
>>
>> "Silently keep track of the original character" is what
>> we currently have with reverting raw string literals.
>> It feels like cheating to require phase 5 to rely on information
>> stashed away in some hiding place in phase 1.
>>
>> However:
>> My reading of the standard is that we don't allow this behavior
>> currently, because we tunnel everything through UCNs. So, allowing
>> more seems evolutionary, which is probably a different paper.
>>
>> > In particular, "there are numbers greater than 10FFFF that can be used"
>> may not be the best implementation strategy.
>>
>> Sure, but that's for the implementation to decide. After all, we're only
>> talking
>> about abstract characters, not about numbers.
>>
>> > > #define CONCAT(x,y) x##y
>> > > CONCAT(\, U0001F431);
>> > >
>> > > Is valid in all implementations I tested, implementation-defined
>> in the standard.
>> >
>> > Is the result the named Unicode character?
>> >
>> > Ok, so be it. Having this as valid is fall-out from the
>> > currently-described approach, and if it's consistent with
>> > what implementations already do, we're good.
>> >
>> > > Do you see a reason to not allow it? in particular, as we move
>> ucns handling later
>> > > in the process, it would make sense to allow these escape
>> sequences to be created in phase 2 and 4 (might be evolutionary, there is a
>> paper)
>> >
>> > I think the status quo already allows creating UCNs like that,
>> > so this doesn't seem to be evolutionary at all.
>> >
>> >
>> > Isn't changing "it is implementation-defined whether ucns are formed"
>> to "ucns are formed" evolutionarry?
>>
>> Yes, it is. Do you have a reference for the "implementation-defined" part
>> here? [cpp.concat] doesn't seem to speak to the issue.
>>
> I don't want to throw a wrench into everything; however, this is what I
> believe the situation is (with the caveat that the C99 Rationale document
> is a product of WG 14 and not of WG 21):
>
> The "status quo" is the result of wording defects. The design intent is
> that the three models are isomorphic by way of making it impossible for the
> user to observe the differences between models. The undefined behaviour
> cases were designed to prevent observance of the model actually used by the
> compiler. The removal of the undefined behaviour is a departure from the
> original design intent.
>

In general, I agree that beyond "an implementation should be allowed to
resolve multiple conversions choices as it sees fit in phases 1 and 5",
very few things should be able to be observable.

>
> The
> fopen("\\ubeda\\file.txt","r")
> example in the rationale document is meant to indicate that
> "\{U+BEDA}"
> is problematic as is plain
> "\\ubeda"
>

We should probably do something about \{U+BEDA} ( i don't think that it can
currently happen ).

>
> Similarly, the observability of funnelling through UCNs is a wording
> defect.
>
> Noting that "undefined behaviour" was the tool of choice at the time the
> wording was produced:
> In a hypothetical word where the wording had indicated that the presence
> of characters not mappable to UCNs is undefined behaviour, the compiler
> would have been free to "do the right thing".
>

If we shift the phases around such that validation of ucns happens after
preprocessing, compilers will be allowed to do the right thing again!

>
>
>>
>> Jens
>> --
>> SG16 mailing list
>> SG16_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>
>



SG16 list run by sg16-owner@lists.isocpp.org