sg16: Re: [SG16-Unicode] SG16 approval for LEWG to review std::filesystem::path

From: Niall Douglas <s_sourceforge_at_[hidden]>
Date: Thu, 4 Jul 2019 18:01:14 +0100

On 04/07/2019 17:23, Lyberta wrote:
>> I don't personally think that implicit construction of a view from other
>> views is problematic. You've already loaded and pointed that footgun.
>>
>> The pointer and size constructors are the same as those for any other
>> string-like view. Not having them would be surprising.
>
> I think the problem is that you require the element past the end of
> range to be readable. That goes against usual ranges invariants. I think
> the logic for checking null terminator should be checking last element
> of the range, like operator[](size() - 1) for non empty range. That way
> users of ranges get the default behavior and no footgun is at play.

Thing is, can you name a real world situation where reading the byte(s)
after the end of a path character range would blow up?

Remember, these are filesystem paths. They don't have the diversity of
sources that a string_view would have. The chances, for example, of a
path_view being constructed from a memory mapped region where the tail
byte is exactly at the end of the mapped region is virtually nil. Any
reasonably likely generation of path data is going to, at worst, have
the character after the input be indeterminate, and not a SIGSEGV to
read. And the standard library can legally do stuff banned in end user
code, such as reading indeterminate bytes. This restricted kinds of
input would not be the case for string_view, where wrapping a whole 4Kb
page into a string_view is an eminently sensible thing to do.

And besides, this is a *documentation* thing. If the API documentation
says "the user must guarantee that the character after is readable",
then violating that is on the user. We can even add it as a contract
precondition. I think that's okay, personally. It's in the same category
as vector::operator[](vector::size()). Just don't do that.

> I think NUL-terminated strings are a big mistake of C and newer Unicode
> library shouldn't use them. std::basic_string already breaks invariants
> by allowing embedded NULs.
>
> I know that it will take a while for operating systems to stop using NUL
> terminated strings but WebAssemply System Interface already fixed that
> by using pointer and size. I expect all future OSes not to use
> NUL-terminated strings.

It's at least a decade or more away. POSIX wants C to implement strings
properly first. C are still umming and ahhing about the best design for
built in string objects, and Martin Uecker is working on a formal
proposal for that.

They're actually very much currently stuck on whether built-in C strings
ought to be bags of chars, or always in UTF-8. Committee is split right
down the middle on that. I suggested to them that they first formalise
dynamic array objects, then build a UTF-8 string object on top, then
everybody is happy. I pointed them at Zach's C++ UTF-8 string library
for study.

Niall

Received on 2019-07-04 19:01:18