sg16: Re: [SG16-Unicode] SG16 approval for LEWG to review std::filesystem::path

From: Tom Honermann <tom_at_[hidden]>
Date: Sat, 6 Jul 2019 17:59:17 -0400

On 7/4/19 1:01 PM, Niall Douglas wrote:
> On 04/07/2019 17:23, Lyberta wrote:
> Thing is, can you name a real world situation where reading the byte(s)
> after the end of a path character range would blow up?
>
> Remember, these are filesystem paths. They don't have the diversity of
> sources that a string_view would have. The chances, for example, of a
> path_view being constructed from a memory mapped region where the tail
> byte is exactly at the end of the mapped region is virtually nil. Any
> reasonably likely generation of path data is going to, at worst, have
> the character after the input be indeterminate, and not a SIGSEGV to
> read. And the standard library can legally do stuff banned in end user
> code, such as reading indeterminate bytes. This restricted kinds of
> input would not be the case for string_view, where wrapping a whole 4Kb
> page into a string_view is an eminently sensible thing to do.

Unfortunately, we don't have means to audit the world wide code base to
determine what programmers do and don't do.

I don't find it at all unlikely that string_view instances will be
implicitly constructed from temporaries of string types that don't
provide a null terminator.

> And besides, this is a *documentation* thing. If the API documentation
> says "the user must guarantee that the character after is readable",
> then violating that is on the user. We can even add it as a contract
> precondition. I think that's okay, personally. It's in the same category
> as vector::operator[](vector::size()). Just don't do that.

I don't think this is ok as it is inconsistent with user expectations.

I'm sensing a contradiction here as well. You have been advocating for
omitting a char based interface because programmers sometimes use it
incorrectly, but here you are claiming that documenting "don't do that"
is sufficient.

> It's at least a decade or more away. POSIX wants C to implement strings
> properly first. C are still umming and ahhing about the best design for
> built in string objects, and Martin Uecker is working on a formal
> proposal for that.
>
> They're actually very much currently stuck on whether built-in C strings
> ought to be bags of chars, or always in UTF-8. Committee is split right
> down the middle on that. I suggested to them that they first formalise
> dynamic array objects, then build a UTF-8 string object on top, then
> everybody is happy. I pointed them at Zach's C++ UTF-8 string library
> for study.

I follow WG14 loosely, but haven't seen any proposals in this area so
far. Is there any existing work you can point me towards?

Tom.

Received on 2019-07-06 23:59:20