sg7: Re: [SG7] std::embed / Compile time IO

From: JeanHeyd Meneide <phdofthehouse_at_[hidden]>
Date: Tue, 19 Nov 2019 13:17:44 -0500

Corentin and SG7:

     I have put my comments in below intermingled with the original
e-mail's text; I apologize if it is hard to read. As a summary:

     - I will not be pursuing std::embed at this time.
     - #embed is not going to be proposed to WG21 right now.
     - I will happily await any new direction for Compile Time I/O, perhaps
in relation to having an entirely constexpr P1130 LLFIO (
https://wg21.link/p1031r2) or within the context of a more powerful
preprocessor/pre-Phase-7 meta-generator that looks like Dr. Sutton's work (
https://www.youtube.com/watch?v=kjQXhuPX-Ac) or Sean Baxter's Circle (
https://github.com/seanbaxter/circle). I am woefully uneducated about the
details and minutiae these approaches.

     P1040 - std::embed is frozen. If you would like to pursue it, feel
free to write a new paper and reference P1040 (or even directly pull
information out of it).

Hello,
> I am a bit concerned with the direction std::embed is going and I'll like
> to see if we can agree on a few things.
>
> A preprocessor based approach does not offer sufficient benefits
>
> I understand that JeanHeyd has been given enough contradictory guidances
> that he might be tempted to go with the preprocessor #embed solution.
> I am concerned that this would not solve anything.
>
> The entire value of std::embed is to improve compile time. Anything that
> would create a node per byte would have disastrous
> compile time performance. My observation is that parsing is not the
> bottleneck.
>

     Parsing, validating, and storing a sequence of expressions that
ultimately is just a comma-delimited, brace-hugged list of numbers creates
problems so bad that even Static Analysis companies have reached out to me
in support of #embed and std::embed. The current specification for #embed
states that it is treated "as if" a brace delimited list of integer-literal
values is generated. A prudent implementation would generate a builtin for
this to save on the parsing of exceptionally large arrays: both GCC and
Clang implementers have stated that this is possible and easy to do (my
implementation is a toy, and thusly does not do this quite yet).

> Sure that might be solvable by an attribute which would change how arrays
> are represented by the compiler.
> Even then, the compiler would have to store the entire file's content in
> memory whereas std::embed can be designed to be backed by a memory mapped
> file.
>

     Having implemented this in 2 compilers, this was not my experience.
For example, Clang already has a SourceManager and FileManager which caches
data from files large and small, and string literals in many cases get
folded together and interned for speed and memory savings. All the
implementation needs to do is point to that memory, memory-mapped or not.
This is the reason for the "as if" language in the wording, to allow
builtins to take advantage of this or potentially other representations
(e.g., "A dedicated AST node", as a person present in the EWG discussion of
#embed put it).

> And so a preprocessor based approach would have little value over
> generated source files and suffer many of the same issue.
> That it can be pushed through the committee faster should not be a reason
> to pursue that direction.
>

     I think you are fundamentally misinterpreting the reasons for #embed.
The goal is to provide a before-Phase-7 (not-constexpr) way for scanners
and dependency managers that read source code to directly resolve resources
(typically, files) without potentially requiring full semantic analysis
(e.g., almost every step of compilation before code generation). This
enables current-generation distributed build systems to work. A
preprocessor-based approach provides exactly that, and given the reception
of P1130 (https://thephd.github.io/vendor/future_cxx/papers/d1130.html) by
EWG, there was no appetite for a modules-specific syntax. There was also no
appetite for a special kind of string literal for this; see P1040's "prior
art" section (
https://thephd.github.io/vendor/future_cxx/papers/d1040.html#design-prior).

     "Push through the Committee" is both presumptuous and extremely
insulting to my treatment of the Process and the guidance laid out in P0939
(https://wg21.link/p0939) and P1000 (https://wg21.link/p1000). Expedience
of Committee acceptance was never a goal; my work has always been thorough,
user-focused, and pooled from listening to the vast number of users and
their varied needs. #embed rose out of the need to support a simple +
intuitive syntax for grabbing file contents in the simplest case ("keep
simple things simple"), distributed build systems, and current dependency
scanning tools: there are no other reasons. std::embed has sat in my lap
for over a year, and despite several e-mails from hobbyist developers to
U.S. National Lab engineers begging me to make progress, I took the time to
understand the entirety of the constituency and propose solutions that
solved their issues. If that is "push through the Committee", then I am not
sure what other words you need to hear to be convinced otherwise.

     But, words are cheap: perhaps this e-mail showing that I am no longer
pursuing the area altogether in WG21 will serve as ultimate proof of my
commitment to the quality, not the speed, of the process.

> Security concerns
> It was always my understanding that embedable resources would be found by
> a mechanism similar to include paths
> and as such giving full filesystem access was never really on the table ?
> Is such a flag insufficient? It would require inspecting what the build
> system does, rather than individual file.
> I would be sympathetic to a per-file mechanism to identify resources that
> can be open but would like decoupled from std::embed
>
> ie:
> #pragma resource "foo.txt"
> [...]
> std::embed("foo.txt")
> We might also want to make all paths relatives and implementations can
> support a white/blacklist of resource path.
> I think it's important to support both file and directories - but
> supporting only files as a first approach seems reasonable.
>

     I am woefully unequipped to answer security concerns. I wish you the
best in figuring it out.

Tooling
> A mechanism decoupled of std::embed as in the paragraph above would
> support the needs of tooling.
> I don't think tooling support should stop std::embed though.
> Having to specify dependencies on resources manually is reasonable given
> that resources should be few
>

     The tooling vendors in the room for std::embed discussion strongly
disagree with your assertions here. In the current world they can find
every resource without having to invoke full compilation. std::embed makes
that hard for them. This was cause for Weakly Against and Strongly Against
votes; please see the Belfast Wiki of EWG and SG7 Discussion for the
individuals and contact them directly (or they can self-volunteer
information here).

> Modules
> To properly supports std::embed and modules, BMIs should keep track of
> resources path when they are compiled relative to the BMIs, or the source
> files
> The idea is that an importing module should have access to the same
> resources as the imported modules.
> Alternatively it can be the responsibility of the build system to deal
> with that.
>

     I am not adept in how modules behave and can offer no reassurances
here.

More General Solution ?
> It is unclear to me that std::embed is not already the general solution.
> It returns a span, which is mean it can be manipulated with any algorithm
> or view adapter offered by the standard library, most of which are
> constexpr already.
> This is strictly more versatile than file which we have fewer tools to
> handle.
> In fact i suspect a primary use case for memory mapped file will be to
> wrap them in span and use them with algorithm.
>
> Moreover, the concerns that std::embed has security concerns but is not
> general enough seem antithetic.
> I don't think we want write capabilities, or i/o on fd that are not file
> on disks - more for reproducibility than security reasons.
> I am also not convinced that the use cases for runtime i/o and compile
> time/io are the same.
>
> Neverther less, i see several solutions:
>
> - Making file and mapped_file as proposed by P1031[1] constexpr
> (partially, we need a small subset), which I believe Niall has been working
> towards
> - Continue with std::embed which does not preclude a constexpr
> stream-like interface on top of a span in the future
>
>
> Trying to blend compile-time and runtime io, seems to me to be only
> seductive on the surface, but in reality the use cases will be different,
> and I believe std::embed serves all the use cases people actually care
> about in practice.
>

     SG7 sees that there is need to solve this problem, but they disagreed
with the assertion that std::embed is the solution in Belfast. SG7 strictly
overviews Compile-time Programming and without their support std::embed
goes nowhere. I also have no direction for continued work, except to try to
envision a new kind of limited stream that pulls from a potential pool of
compiler-specified directions. std::resourcestream, or something similar?
It is unclear what the forward direction is or should be, and I do not have
the time to join SG7 in discovering a potentially new direction for
Compile-time Programming. Given the power that std::embed in its current
form enables and the use cases it was designed to cover from Build Systems
and seeing the alternatives presented for P1040 in small-room SG7
discussion, I would rather not move forward in shaky, unsure territory and
instead will freeze the current proposal progress.

     I have no plans on unfreezing it. Feel free to write your own paper(s)
on the subject which covers the exact same space as std::embed; this
message should serve as proof to individuals who ask "Did you Consult with
Author X about P1040, it seems similar?" that you do not need to wait or
wonder where this paper is going.

Best of Luck,
JeanHeyd Meneide

Received on 2019-11-19 12:20:15