C++ Logo

sg16

Advanced search

Re: [isocpp-sg16] u+000d carriage return in source files

From: Jens Maurer <jens.maurer_at_[hidden]>
Date: Mon, 22 Dec 2025 17:41:57 +0100
On 12/17/25 22:34, Alisdair Meredith via SG16 wrote:
> I suspect this is all deliberately designed and specified for legacy reasons, but want to confirm that we are happy with the status quo.
>
> When we map a UTF-8 file in phase on of translation, we lose all u+000d carriage return code units in favor or new-line characters.

Yes, that's the specified treatment of (strict) UTF-8 files.

An implementation is at liberty to offer support for
not-quite-UTF-8 files by e.g. saying "it's almost like UTF-8,
except carriage return is retained".

> However, if we instead map any other implementation-defined encoding, we can retain u+000d carriage return characters along-side the new-line characters. That is because we apply the DOS-line-ending transformation on only the UTF-8 part of phase 1 — would it make sense to move that rewrite to after either kind of encoding has been mapped?
Then implementations can't offer a "retain carriage return" mode anymore,
under the implementation-defined phase 1 rules. That seems like a step
backward for those environments where that is desirable.

> Looking into phase 2, that kind of transform is very similar to ignoring any leading BOM, would it make sense to move that whole transform into phase 2?

No. The BOM removal is only relevant for the "strict UTF-8" case of phase 1;
all the implementation-defined stuff in phase 1 can opt to never generate a
BOM. On the other hand, if a BOM were to survive into phase 3, it would
probably be a non-whitespace character that is not part of a preprocessing
token and your program would explode ... so we want to unconditionally
remove BOMs from the character stream of the input.

> Is this a topic worth raising (post C++26)?

What's the practical problem that causes actual pain here?

Jens

Received on 2025-12-22 16:42:04