C++ Logo

sg16

Advanced search

Re: [SG16] During lexing, What constitute new lines and whitespaces ?

From: Corentin <corentin.jabot_at_[hidden]>
Date: Mon, 1 Jun 2020 17:07:10 +0200
On Mon, Jun 1, 2020, 17:04 Peter Brett <pbrett_at_[hidden]> wrote:

> Is it viable to defer to Unicode for the definitions of new-line and
> whitespace?
>

We do it for uax31 in Steve paper so we might be able to.
But it might be worth be explicit about new lines, there are no category or
block for that and some new lines are more than one codepoint.

I think white space my be definable in terms of pattern_white_space + new
lines

>
>
> Peter
>
>
>
> *From:* SG16 <sg16-bounces_at_[hidden]> *On Behalf Of *Corentin via
> SG16
> *Sent:* 01 June 2020 13:54
> *To:* SG16 <sg16_at_[hidden]>
> *Cc:* Corentin <corentin.jabot_at_[hidden]>
> *Subject:* [SG16] During lexing, What constitute new lines and
> whitespaces ?
>
>
>
> EXTERNAL MAIL
>
>
>
> The standard doesn't specify what the new-line character is.
>
> According to Unicode, the following codepoint sequences should be
> considered lines terminators
>
>
>
> LF: Line Feed, U+000A
> VT: Vertical Tab, U+000B
> FF: Form Feed, U+000C
> CR: Carriage Return, U+000D
> CR+LF: CR (U+000D) followed by LF (U+000A)
> NEL: Next Line, U+0085
> LS: Line Separator, U+2028
> PS: Paragraph Separator, U+2029
>
>
>
> Similarly, the standard defines "white spaces" loosely as "blanks,
> horizontal and vertical tabs", however there are more white space
> characters in unicode https://en.wikipedia.org/wiki/Whitespace_character
> <https://urldefense.com/v3/__https:/en.wikipedia.org/wiki/Whitespace_character__;!!EHscmS1ygiU1lA!U3jVWO1pCSmf1L_-FNLrCBq4cRE-jLHIKOCQR5VZjm0b0pNODFPUUdBU8Oj0NA$>
>
>
>
> What I would like to do:
>
>
>
> * Define new-line and white-spaces as grammar term, with an explicit list
> of codepoint sequences.
>
> * In phase 2, replace all characters which represent a line termination
> with Line Feed (which is reverted later for raw string literals). this
> would notably fix https://wg21.link/cwg1655
> <https://urldefense.com/v3/__https:/wg21.link/cwg1655__;!!EHscmS1ygiU1lA!U3jVWO1pCSmf1L_-FNLrCBq4cRE-jLHIKOCQR5VZjm0b0pNODFPUUdBPzn6CdA$>
>
> * It would also help to mandate that trailing whitespaces are removed in
> phase 2
>
>
>
> Does that make sense to anyone ?
>

Received on 2020-06-01 10:10:30