sg16: Re: [SG16-Unicode] SG16 approval for LEWG to review std::filesystem::path

From: Niall Douglas <s_sourceforge_at_[hidden]>
Date: Sun, 7 Jul 2019 23:31:35 +0100

On 06/07/2019 22:59, Tom Honermann wrote:
> On 7/4/19 1:01 PM, Niall Douglas wrote:
>> On 04/07/2019 17:23, Lyberta wrote:
>> Thing is, can you name a real world situation where reading the byte(s)
>> after the end of a path character range would blow up?
>>
>> Remember, these are filesystem paths. They don't have the diversity of
>> sources that a string_view would have. The chances, for example, of a
>> path_view being constructed from a memory mapped region where the tail
>> byte is exactly at the end of the mapped region is virtually nil. Any
>> reasonably likely generation of path data is going to, at worst, have
>> the character after the input be indeterminate, and not a SIGSEGV to
>> read. And the standard library can legally do stuff banned in end user
>> code, such as reading indeterminate bytes. This restricted kinds of
>> input would not be the case for string_view, where wrapping a whole 4Kb
>> page into a string_view is an eminently sensible thing to do.
>
> Unfortunately, we don't have means to audit the world wide code base to
> determine what programmers do and don't do.
>
> I don't find it at all unlikely that string_view instances will be
> implicitly constructed from temporaries of string types that don't
> provide a null terminator.

I never claimed that. The whole point of the probe for a null terminator
on systems whose filesystem APIs require a null terminator is to check
whether a memory copy into temporary storage in order to null terminate
is needed.

So we really don't expect null termination. We do expect that path views
will be subsets of larger path strings, and that THOSE will be null
terminated.

This is why the requirement that the character after the path view be
readable is a highly undemanding requirement. It's actually *really
hard* to not fulfil it.

In my own code, only ever once have I encountered a situation where it
might not have been fulfilled. I was passing a filesystem path between
processes by shared memory. There was a tiny possibility that if the
path + shared memory structure EXACTLY equalled 4Kb, then the character
after the path might be unreadable.

Solution to problem: Extend shared memory structure by one (zeroed)
character, so there is already a readable character after the path view.
Problem solved.

> I'm sensing a contradiction here as well. You have been advocating for
> omitting a char based interface because programmers sometimes use it
> incorrectly, but here you are claiming that documenting "don't do that"
> is sufficient.

There is no contradiction.

My argument is based on my estimation of the statistical chance of
developer surprise. Developers routinely get surprised when non-Western
paths fail in weird ways. Whereas the statistical chance that the
character after a path view is not legally readable I think is extremely
low in real world use cases of path views, and especially if the
documentation for a given constructor clearly states: "the character
after the input must be readable in order to use this constructor".

If the developer cannot guarantee that, then they use a different
constructor. No problem.

>> It's at least a decade or more away. POSIX wants C to implement strings
>> properly first. C are still umming and ahhing about the best design for
>> built in string objects, and Martin Uecker is working on a formal
>> proposal for that.
>>
>> They're actually very much currently stuck on whether built-in C strings
>> ought to be bags of chars, or always in UTF-8. Committee is split right
>> down the middle on that. I suggested to them that they first formalise
>> dynamic array objects, then build a UTF-8 string object on top, then
>> everybody is happy. I pointed them at Zach's C++ UTF-8 string library
>> for study.
>
> I follow WG14 loosely, but haven't seen any proposals in this area so
> far. Is there any existing work you can point me towards?

I am unaware of any, but that's because I didn't ask, and don't have a
huge interest in the area. Let me introduce you to Martin Uecker, who
would know. I'll CC you shortly.

Niall

Received on 2019-07-08 00:31:37