ISOCPP SG15 List: Re: P2898R0: Importable Headers are Not Universally Implementable

From: Gabriel Dos Reis <gdr_at_[hidden]>
Date: Wed, 24 May 2023 11:34:48 +0000

Note that the needs to change Ninja predated the adoption of C++ Modules. My understanding is that it was largely motivated by the needs of Fortran Modules - see papers from Kitware for historical contexts. For instance: https://www.kitware.com/import-cmake-c20-modules/

Quoting:

In 2015, the Trilinos project funded an effort to add support for Fortran modules to the ninja build tool by adding the dyndep feature which lived in a ninja fork maintained by Kitware for four long years. See the documentation for dyndep<https://ninja-build.org/manual.html#ref_dyndep> for more information. With the announcement of C++ 20 modules the ninja team was convinced it was worth the effort to merge this work upstream from the Kitware fork and in May of 2019 dyndeps were merged into ninja.

-- Gaby

From: SG15 <sg15-bounces_at_[hidden]> On Behalf Of Tom Honermann via SG15
Sent: Tuesday, May 23, 2023 6:54 PM
To: Daniel Ruoso <daniel_at_[hidden]>
Cc: Tom Honermann <tom_at_[hidden]>; sg15_at_[hidden]
Subject: Re: [SG15] P2898R0: Importable Headers are Not Universally Implementable

On 5/23/23 11:54 AM, Daniel Ruoso wrote:

Em seg., 22 de mai. de 2023 às 16:17, Tom Honermann

<tom_at_[hidden]><mailto:tom_at_[hidden]> escreveu:

First, I don't know what is meant by "universally implementable". The paper doesn't offer a definition of the term and its use within the paper doesn't present an intuitive meaning, at least not for me (note that "universally implementable" appears only in the title and in an assertion in section 2.1 in the discussion of #pragma once).

The Abstract covers what I mean. Essentially, the C++ specification

needs to be implementable in all the places where C++ is currently

used, and the specification of Importable Headers currently fails that

criteria. It is only implementable in a subset of environments where

C++ is used.
What subset of environments is it not implementable in? The paper doesn't actually say as far as I can tell. What is a concrete example?

If the concern is that dependency scanning for importable headers is dependent on which implementation the result of the dependency scanning will be used to construct a build system for, then I agree that dependency scanning does not produce an implementation independent result.

It's not about it being implementation-independent or not, it's about

the costs being prohibitive in a lot of environments where C++ is

used. Everything is technically possible, but every decision has

costs, some of which may be unaffordable to some use cases. IMHO, the

costs of Importable Headers are unaffordable for environments that use

open-ended build systems, such as systems that have dependencies

expressed entirely by producing binary artifacts that are used as

dependencies of other builds. This is the case at Bloomberg, but it's

also the case for systems like Conan, vcpkg, as well as most GNU/Linux

distributions.
The paper doesn't quantify costs in any way. The closest it comes is (correctly) noting that dependency scanning for header units requires computing imported macros in a bottom up way that is not required for named modules. But the paper doesn't quantify that cost. Is it a 5% hit? A 95% hit? Linear with respect to the number of header units? Other papers have offered quantification; see P1441 (Are modules fast?)<https://wg21.link/p1441> for example.

But I also don't find that to be particularly concerning either; absent a (possibly implementation dependent and often complicated) include search path, it is not possible, in general, to universally map #include directives to source files either.

Correct. Prior to modules, that had a specific cost. The build system

needs to incorporate the source inclusions into the dependency graph

after the initial compilation such that incremental builds are done

correctly.

With named modules, the build system needs to be able to incorporate

dependencies *prior* to a clean build. This required a change in

ninja, for instance. And this is what has allowed us to have a proper

plan on how we're going to get that implemented.
Indeed; I believe it is equivalent to requirements for support of generated headers.

With Importable Headers, however, the dependency scanning has itself a

dependency on additional information before any source file is even

read, I'll dive into this in the next paragraphs.

Section 2.1 states:

This identity problem has always been a complicated topic for the C++ specification, the `#pragma once` directive has been supported by various implementations in varying degrees of compatibility, but it cannot be universally implemented because we don't have a way of specifying what is the thing that should be included only "once" given the way that header search works.

This issue does exist (and has for a very long time) but I don't see how it is relevant. Since header to source file mapping is implementation-defined, a dependency scanner or build system must match the implementation-defined behavior for the targeted implementations.

The dependency scanning needs to be able to map how the importable

header is specified in order to understand which `#include` directives

can be turned into imports.

That means we need to establish an identity between the files actually

opened by the dependency scanning process to the tokens used in both

`#include` and `import` directives before the dependency scanning even

runs.
Yes. This is what I meant by "a dependency scanner or build system must match the implementation-defined behavior for the targeted implementations."

Section 2.2 states:

The cost of that approach, however, is that we create a significant bottleneck in the dependency chain of any given object. Changing the list of Importable Headers or the Local Preprocessor Arguments for any one of them will result in a complete invalidation of the dependency scanning for all translation units in the project.

I think the bold text is not quite correct. Changing whether a header is importable or not only invalidates the TUs that include/import it.

Not quite, it invalidates the dependency scanning itself, because the

switch from source inclusion to import (even if it's still spelled as

`#include` in the source) can change the way the preprocessor handles

the code after the import, which can change the output of the

dependency scanning.

The problem, again, is that this information is needed before the

dependency scanning even runs. If that information is an input to the

dependency scanning, it means changing it invalidates it.
However, the impacted TUs are only those that already had a #include directive for the header. TUs that didn't are not affected. I do appreciate that optimizing for this case requires understanding precisely what changed and that the simplest approach is to just perform the whole scan. This seems like a QoI issue to me.

Likewise, changing how a header unit is built only invalidates the TUs that import it (which may affect additional TUs if the importing TU is also an importable one).

Again, changing the initial preprocessor state of the importable

header unit invalidates the dependency scanning itself for the entire

project, since the dependency scanning has to emulate the behavior of

the import. And the dependency scanning needs to run before we know

anything else about the code.

Per above, it seems that we disagree on this point.

The problem is equivalent to an update to a generated header file. Generated headers likewise need to be built and scanned as part of dependency generation. But updating one of them (via an update to the generator or another of the inputs) doesn't invalidate the dependency information for the entire project; it just invalidates the dependencies for the TUs that include the generated header.

So, instead of "will result", I would substitute "might result". I don't see any reason why a build system can't cache these results and update them when they are found to be violated;

I'm not sure what you mean by that. The way that this works for named

modules is:

1. You start with the compiler command line for the translation unit,

and use that to calculate the command line for dependency scanning

(input is only the build configuration)

2. The dependency scanning for that translation unit produces P1689.

The inputs at this point are just the command line options and the

contents of the file. The output is only module-level dependencies,

because source inclusion is only needed to augment the incremental

build.

3. The dependency information is collated by the build system in order

to generate the input module map for each TU. This, again, does not

depend on source inclusion information. And if the build system uses a

predictable naming scheme for modules, it doesn't need to know how the

other TU was produced, it can just leave the dependency edge to be

tracked by the build system.
That sounds right given the restrictions on module units in [module.import]p1<http://eel.is/c++draft/module.import#1>.

The end result is that for the initial build, the only inputs for a

translation unit are the command line, the sources for the translation

unit and the files provided by other targets in the build system that

are listed in the module map.
As well as any generated header files that are included in the TU (typically none of course).

When Importable Headers are introduced, we have an input to the

dependency scanning process, which means that the rule to produce the

module mapper depends on the list of importable headers and the

arguments to those. The module mapper data is an input to the actual

translation, which will invalidate the translation unit itself.
No, the invalidation only occurs if the TU actually uses that input.

The end result is that any change to that input invalidates the builds

of all translation units in the project.
Per prior statements, I disagree.

Some build systems may be able to work around that issue in various

ways. The way Clang Header Modules and the early MSVC adoption works

around it is by assuming that the dependency scanning doing a plain

source inclusion is equivalent to the actual dependencies that will be

needed later. It is considered a user error if that is not true.
MSVC and Clang with explicit modules assumes that, yes. Clang modules is smarter about it when modules are implicitly built. I haven't played with Clang with explicitly built modules to know if it diagnoses incorrect assumptions.

IMHO, that is not a "correct" implementation of the specification. The

specification requires a correct dependency scanning to emulate the

import, by starting a new preprocessor context based on the arguments

used when translating the header unit and merging the final state into

the state of the TU doing the import. Which means it needs to know the

list of header units and the arguments for each of those.
I would argue that is a QoI concern, but I otherwise agree.

Doing the correct thing will mean any environment using a build system

that doesn't know how to optimize based on the contents of the output

of each command will have to invalidate the entire build whenever

information about importable headers change.
Ah, that we agree on.

I would like to see some real numbers before concluding that there is a problem to be solved though.

A change in the information about importable headers causing the

entire build to be invalidated is catastrophically bad, and will be

completely unaffordable in many environments where C++ is used.
Come on. You have to do better than that. It certainly isn't catastrophically bad if the scanning step only takes 2 seconds.

Header units don't have to perform as well as named modules; they just have to perform better than source inclusion and have an adoption cost less than named modules to be a potentially attractive and viable solution.

The comparison is not on Header Units versus Source Inclusion, the

comparison is on what happens in the real adoption on existing build

systems without the need to completely reinvent how those work.
That's the thing. Named modules do require reinventing build systems. None of the major projects I've worked on except for Clang has used a build system that can handle named modules or be reasonably extended to do so. But they can all handle (implicitly built) header units just fine and get an advantage by doing so.

Section 2.3 states:

This is going to be particularly challenging if the ecosystem ends up in a situation where different compilers make different choices about how to handle the implicit replacement of `#include` by the equivalent `import`.

Every implementation will need to be told which header files are importable.

I don't mean how they take it as input. But it is

implementation-dependent whether an `#include` is or isn't replaced by

the equivalent `import`, which means that the same project using the

same information about what are the importable headers to be used

could still end up with different results.
Agreed. The dependency scanner and build system must be aligned with the behavior of the implementation they target.

Section 3.1 states:

The main restriction that enables a interoperable use of pre-compiled headers is that the translation unit has to use it as a preamble to the translation unit, meaning the precompiled header is guaranteed not to be influenced by any other code in the translation unit

I think I understand what you are trying to say here, but I find the wording awkward.

The important bit about pre-compiled headers is that the source

inclusion is fundamentally equivalent to using the pre-compiled

header.
If used correctly, that is true. But I think it is just as true for header units when used correctly.

This is not true for Clang Header Modules (in principle, it's

assumed the user will only choose to declare clang header modules for

cases where that should be true). It is also not true for Importable

Headers in general. There are two entirely different semantics which

can result in entirely different programs.
Kind of. The intent is that, if a header file is importable, then #include and import will behave the same and, if the header file is not importable, #include does source inclusion and import is ill-formed. So, when used correctly and consistently, there is only one semantic that is expressed.

The first sentence is not quite true. Clang modules allow for BMI creation and selection to be dependent on macro definitions defined for the importing TU. See the config_macros declaration.

The important point is that the working of the preprocessor is

fundamentally different from that of source inclusion. And that so far

all early experiences rely on the user not selecting the wrong header

to be importable, and there is no mechanism to detect whether the user

chose the wrong thing or not. It will just fail to compile if you're

lucky, or result in ODR violations otherwise.

Agreed, much like source inclusion results in ODR violations when a header file is included in multiple TUs that have distinct preprocessor state that the header file is sensitive to. Header units don't (fully) solve that problem (they do a little bit thanks to some of the semantic differences such as not being sensitive to the preprocessor state of the importing TU and preventing use of macros that are differently defined between the importing TU and an imported TU).

Tom.

daniel

Received on 2023-05-24 11:34:54