ISOCPP SG15 List: Re: P2898R0: Importable Headers are Not Universally Implementable

From: Tom Honermann <tom_at_[hidden]>
Date: Wed, 24 May 2023 13:04:50 -0400

On 5/23/23 10:44 PM, Bret Brown wrote:
> Two broad points:
>
> First, Tom refers to singular dependency scanner, build system, and
> target implementation.

Not quite. My intent was to state that they must behave consistently;
for example, the dependency scanner, build system, and compiler must all
be in sync when it comes to how #include directives are mapped to source
files. Likewise, they must be in sync when it comes to whether a
particular #include directive is rewritten to an import declaration or
whether an import declaration names an importable header. Basically,
they all require the same input with regard to (a subset of) the
parameters of the abstract machine.

> In practice, projects can and do target multiples of each, and build
> systems might not even exist. I also consider static analysis tools,
> interactive editing environments, remote build workflows, language
> servers from other toolchains, and so on. I cannot bring myself to
> hand-wave that this is a simple matter of programming when these tools
> all need to support professional development workflows by presenting
> coherent understanding of a codebase to the end user. I can see how
> treating importable headers just an optimization on textual inclusion
> provides coherency. Alternatively, I can see how consistent and
> portable rules about implicit importation are coherent as well. I'm
> not seeing other designs that work, just statements of faith that they
> should and do in cases where there's a sufficiently smart build
> system. Speaking of evidence, I think evidence that importable headers
> work well outside of well-governed and comprehensive build systems
> (i.e., monorepos) is thin so far. Specifics would help a lot here.
I can't speak to use of C++20 header units or explicit builds of Clang
modules. What I can say is that, while at Coverity, I triaged many bug
reports from customers that were using Clang modules with implicit
module builds (the bug reports were unrelated to use of modules in most
cases). I can't offer a report count or a percentage. (I can also say
that the inability to use preprocessed output to reproduce a problem
encountered in the field is a major PITA).
>
> Second, asking for that evidence is partly tongue in cheek because I
> think the biggest point against importable headers is the paucity of
> investment to support them at the tooling and training level. I'm
> willing to be proven wrong though. For instance, maybe parties who
> find this feature straightforward and supportable can find ways to
> contribute build system enhancements, importability linters, packaging
> metadata, how-to guides, and so on to bring them closer to being
> settled and teachable technology. Otherwise, we might end up with a
> feature that is disappointingly niche (or dead) for practical reasons
> if not technical ones. At some point, that's reason enough to
> destandardize or at least narrow the scope of the feature, at least in
> my mind.
Maybe that lack of support is because, in practice, implicitly built
modules, despite their limitations, actually work pretty well.
>
> In other words, why are we making significant progress [1] on making
> named modules widely useful but not making the same headway on
> importable headers?

Cultural concerns are a possibility; the people working on this might
just have a preference for named modules.

Tom.

>
> I do appreciate all the engagement on this issue, by the way. It's
> clear to me that the path forward is by improving our collective
> understanding of the opportunities and challenges here.
>
> Bret
>
> [1] I feel like investment in named modules is also below what is
> justified, but at least that feature is getting somewhere.
>
>
>
> On Tue, May 23, 2023, 21:54 Tom Honermann via SG15
> <sg15_at_[hidden]> wrote:
>
> On 5/23/23 11:54 AM, Daniel Ruoso wrote:
>> Em seg., 22 de mai. de 2023 às 16:17, Tom Honermann
>> <tom_at_[hidden]> <mailto:tom_at_[hidden]> escreveu:
>>> First, I don't know what is meant by "universally implementable". The paper doesn't offer a definition of the term and its use within the paper doesn't present an intuitive meaning, at least not for me (note that "universally implementable" appears only in the title and in an assertion in section 2.1 in the discussion of #pragma once).
>> The Abstract covers what I mean. Essentially, the C++ specification
>> needs to be implementable in all the places where C++ is currently
>> used, and the specification of Importable Headers currently fails that
>> criteria. It is only implementable in a subset of environments where
>> C++ is used.
> What subset of environments is it not implementable in? The paper
> doesn't actually say as far as I can tell. What is a concrete example?
>>> If the concern is that dependency scanning for importable headers is dependent on which implementation the result of the dependency scanning will be used to construct a build system for, then I agree that dependency scanning does not produce an implementation independent result.
>> It's not about it being implementation-independent or not, it's about
>> the costs being prohibitive in a lot of environments where C++ is
>> used. Everything is technically possible, but every decision has
>> costs, some of which may be unaffordable to some use cases. IMHO, the
>> costs of Importable Headers are unaffordable for environments that use
>> open-ended build systems, such as systems that have dependencies
>> expressed entirely by producing binary artifacts that are used as
>> dependencies of other builds. This is the case at Bloomberg, but it's
>> also the case for systems like Conan, vcpkg, as well as most GNU/Linux
>> distributions.
> The paper doesn't quantify costs in any way. The closest it comes
> is (correctly) noting that dependency scanning for header units
> requires computing imported macros in a bottom up way that is not
> required for named modules. But the paper doesn't quantify that
> cost. Is it a 5% hit? A 95% hit? Linear with respect to the number
> of header units? Other papers have offered quantification; see
> P1441 (Are modules fast?) <https://wg21.link/p1441> for example.
>>> But I also don't find that to be particularly concerning either; absent a (possibly implementation dependent and often complicated) include search path, it is not possible, in general, to universally map #include directives to source files either.
>> Correct. Prior to modules, that had a specific cost. The build system
>> needs to incorporate the source inclusions into the dependency graph
>> after the initial compilation such that incremental builds are done
>> correctly.
>>
>> With named modules, the build system needs to be able to incorporate
>> dependencies *prior* to a clean build. This required a change in
>> ninja, for instance. And this is what has allowed us to have a proper
>> plan on how we're going to get that implemented.
> Indeed; I believe it is equivalent to requirements for support of
> generated headers.
>> With Importable Headers, however, the dependency scanning has itself a
>> dependency on additional information before any source file is even
>> read, I'll dive into this in the next paragraphs.
>>> Section 2.1 states:
>>>> This identity problem has always been a complicated topic for the C++ specification, the `#pragma once` directive has been supported by various implementations in varying degrees of compatibility, but it cannot be universally implemented because we don’t have a way of specifying what is the thing that should be included only “once” given the way that header search works.
>>> This issue does exist (and has for a very long time) but I don't see how it is relevant. Since header to source file mapping is implementation-defined, a dependency scanner or build system must match the implementation-defined behavior for the targeted implementations.
>> The dependency scanning needs to be able to map how the importable
>> header is specified in order to understand which `#include` directives
>> can be turned into imports.
>>
>> That means we need to establish an identity between the files actually
>> opened by the dependency scanning process to the tokens used in both
>> `#include` and `import` directives before the dependency scanning even
>> runs.
> Yes. This is what I meant by "a dependency scanner or build system
> must match the implementation-defined behavior for the targeted
> implementations."
>>> Section 2.2 states:
>>>> The cost of that approach, however, is that we create a significant bottleneck in the dependency chain of any given object. Changing the list of Importable Headers or the Local Preprocessor Arguments for any one of them will result in a complete invalidation of the dependency scanning for all translation units in the project.
>>> I think the bold text is not quite correct. Changing whether a header is importable or not only invalidates the TUs that include/import it.
>> Not quite, it invalidates the dependency scanning itself, because the
>> switch from source inclusion to import (even if it's still spelled as
>> `#include` in the source) can change the way the preprocessor handles
>> the code after the import, which can change the output of the
>> dependency scanning.
>>
>> The problem, again, is that this information is needed before the
>> dependency scanning even runs. If that information is an input to the
>> dependency scanning, it means changing it invalidates it.
> However, the impacted TUs are only those that already had a
> #include directive for the header. TUs that didn't are not
> affected. I do appreciate that optimizing for this case requires
> understanding precisely what changed and that the simplest
> approach is to just perform the whole scan. This seems like a QoI
> issue to me.
>>> Likewise, changing how a header unit is built only invalidates the TUs that import it (which may affect additional TUs if the importing TU is also an importable one).
>> Again, changing the initial preprocessor state of the importable
>> header unit invalidates the dependency scanning itself for the entire
>> project, since the dependency scanning has to emulate the behavior of
>> the import. And the dependency scanning needs to run before we know
>> anything else about the code.
>
> Per above, it seems that we disagree on this point.
>
> The problem is equivalent to an update to a generated header file.
> Generated headers likewise need to be built and scanned as part of
> dependency generation. But updating one of them (via an update to
> the generator or another of the inputs) doesn't invalidate the
> dependency information for the entire project; it just invalidates
> the dependencies for the TUs that include the generated header.
>
>>> So, instead of "will result", I would substitute "might result". I don't see any reason why a build system can't cache these results and update them when they are found to be violated;
>> I'm not sure what you mean by that. The way that this works for named
>> modules is:
>>
>> 1. You start with the compiler command line for the translation unit,
>> and use that to calculate the command line for dependency scanning
>> (input is only the build configuration)
>> 2. The dependency scanning for that translation unit produces P1689.
>> The inputs at this point are just the command line options and the
>> contents of the file. The output is only module-level dependencies,
>> because source inclusion is only needed to augment the incremental
>> build.
>> 3. The dependency information is collated by the build system in order
>> to generate the input module map for each TU. This, again, does not
>> depend on source inclusion information. And if the build system uses a
>> predictable naming scheme for modules, it doesn't need to know how the
>> other TU was produced, it can just leave the dependency edge to be
>> tracked by the build system.
> That sounds right given the restrictions on module units in
> [module.import]p1 <http://eel.is/c++draft/module.import#1>.
>> The end result is that for the initial build, the only inputs for a
>> translation unit are the command line, the sources for the translation
>> unit and the files provided by other targets in the build system that
>> are listed in the module map.
> As well as any generated header files that are included in the TU
> (typically none of course).
>> When Importable Headers are introduced, we have an input to the
>> dependency scanning process, which means that the rule to produce the
>> module mapper depends on the list of importable headers and the
>> arguments to those. The module mapper data is an input to the actual
>> translation, which will invalidate the translation unit itself.
> No, the invalidation only occurs if the TU actually uses that input.
>> The end result is that any change to that input invalidates the builds
>> of all translation units in the project.
> Per prior statements, I disagree.
>> Some build systems may be able to work around that issue in various
>> ways. The way Clang Header Modules and the early MSVC adoption works
>> around it is by assuming that the dependency scanning doing a plain
>> source inclusion is equivalent to the actual dependencies that will be
>> needed later. It is considered a user error if that is not true.
> MSVC and Clang with explicit modules assumes that, yes. Clang
> modules is smarter about it when modules are implicitly built. I
> haven't played with Clang with explicitly built modules to know if
> it diagnoses incorrect assumptions.
>> IMHO, that is not a "correct" implementation of the specification. The
>> specification requires a correct dependency scanning to emulate the
>> import, by starting a new preprocessor context based on the arguments
>> used when translating the header unit and merging the final state into
>> the state of the TU doing the import. Which means it needs to know the
>> list of header units and the arguments for each of those.
> I would argue that is a QoI concern, but I otherwise agree.
>> Doing the correct thing will mean any environment using a build system
>> that doesn't know how to optimize based on the contents of the output
>> of each command will have to invalidate the entire build whenever
>> information about importable headers change.
> Ah, that we agree on.
>>> I would like to see some real numbers before concluding that there is a problem to be solved though.
>> A change in the information about importable headers causing the
>> entire build to be invalidated is catastrophically bad, and will be
>> completely unaffordable in many environments where C++ is used.
> Come on. You have to do better than that. It certainly isn't
> catastrophically bad if the scanning step only takes 2 seconds.
>>> Header units don't have to perform as well as named modules; they just have to perform better than source inclusion and have an adoption cost less than named modules to be a potentially attractive and viable solution.
>> The comparison is not on Header Units versus Source Inclusion, the
>> comparison is on what happens in the real adoption on existing build
>> systems without the need to completely reinvent how those work.
> That's the thing. Named modules *do* require reinventing build
> systems. None of the major projects I've worked on except for
> Clang has used a build system that can handle named modules or be
> reasonably extended to do so. But they can all handle (implicitly
> built) header units just fine and get an advantage by doing so.
>>> Section 2.3 states:
>>>> This is going to be particularly challenging if the ecosystem ends up in a situation where different compilers make different choices about how to handle the implicit replacement of `#include` by the equivalent `import`.
>>> Every implementation will need to be told which header files are importable.
>> I don't mean how they take it as input. But it is
>> implementation-dependent whether an `#include` is or isn't replaced by
>> the equivalent `import`, which means that the same project using the
>> same information about what are the importable headers to be used
>> could still end up with different results.
> Agreed. The dependency scanner and build system must be aligned
> with the behavior of the implementation they target.
>>> Section 3.1 states:
>>>> The main restriction that enables a interoperable use of pre-compiled headers is that the translation unit has to use it as a preamble to the translation unit, meaning the precompiled header is guaranteed not to be influenced by any other code in the translation unit
>>> I think I understand what you are trying to say here, but I find the wording awkward.
>> The important bit about pre-compiled headers is that the source
>> inclusion is fundamentally equivalent to using the pre-compiled
>> header.
> If used correctly, that is true. But I think it is just as true
> for header units when used correctly.
>> This is not true for Clang Header Modules (in principle, it's
>> assumed the user will only choose to declare clang header modules for
>> cases where that should be true). It is also not true for Importable
>> Headers in general. There are two entirely different semantics which
>> can result in entirely different programs.
> Kind of. The intent is that, if a header file is importable, then
> #include and import will behave the same and, if the header file
> is not importable, #include does source inclusion and import is
> ill-formed. So, when used correctly and consistently, there is
> only one semantic that is expressed.
>>> The first sentence is not quite true. Clang modules allow for BMI creation and selection to be dependent on macro definitions defined for the importing TU. See the config_macros declaration.
>> The important point is that the working of the preprocessor is
>> fundamentally different from that of source inclusion. And that so far
>> all early experiences rely on the user not selecting the wrong header
>> to be importable, and there is no mechanism to detect whether the user
>> chose the wrong thing or not. It will just fail to compile if you're
>> lucky, or result in ODR violations otherwise.
>
> Agreed, much like source inclusion results in ODR violations when
> a header file is included in multiple TUs that have distinct
> preprocessor state that the header file is sensitive to. Header
> units don't (fully) solve that problem (they do a little bit
> thanks to some of the semantic differences such as not being
> sensitive to the preprocessor state of the importing TU and
> preventing use of macros that are differently defined between the
> importing TU and an imported TU).
>
> Tom.
>
>> daniel
> _______________________________________________
> SG15 mailing list
> SG15_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg15
>

Received on 2023-05-24 17:04:52