ISOCPP SG15 List: Re: P2898R0: Importable Headers are Not Universally Implementable

From: Daniel Ruoso <daniel_at_[hidden]>
Date: Tue, 23 May 2023 11:54:32 -0400

Em seg., 22 de mai. de 2023 às 16:17, Tom Honermann
<tom_at_[hidden]> escreveu:
> First, I don't know what is meant by "universally implementable". The paper doesn't offer a definition of the term and its use within the paper doesn't present an intuitive meaning, at least not for me (note that "universally implementable" appears only in the title and in an assertion in section 2.1 in the discussion of #pragma once).

The Abstract covers what I mean. Essentially, the C++ specification
needs to be implementable in all the places where C++ is currently
used, and the specification of Importable Headers currently fails that
criteria. It is only implementable in a subset of environments where
C++ is used.

> If the concern is that dependency scanning for importable headers is dependent on which implementation the result of the dependency scanning will be used to construct a build system for, then I agree that dependency scanning does not produce an implementation independent result.

It's not about it being implementation-independent or not, it's about
the costs being prohibitive in a lot of environments where C++ is
used. Everything is technically possible, but every decision has
costs, some of which may be unaffordable to some use cases. IMHO, the
costs of Importable Headers are unaffordable for environments that use
open-ended build systems, such as systems that have dependencies
expressed entirely by producing binary artifacts that are used as
dependencies of other builds. This is the case at Bloomberg, but it's
also the case for systems like Conan, vcpkg, as well as most GNU/Linux
distributions.

> But I also don't find that to be particularly concerning either; absent a (possibly implementation dependent and often complicated) include search path, it is not possible, in general, to universally map #include directives to source files either.

Correct. Prior to modules, that had a specific cost. The build system
needs to incorporate the source inclusions into the dependency graph
after the initial compilation such that incremental builds are done
correctly.

With named modules, the build system needs to be able to incorporate
dependencies *prior* to a clean build. This required a change in
ninja, for instance. And this is what has allowed us to have a proper
plan on how we're going to get that implemented.

With Importable Headers, however, the dependency scanning has itself a
dependency on additional information before any source file is even
read, I'll dive into this in the next paragraphs.

> Section 2.1 states:
> > This identity problem has always been a complicated topic for the C++ specification, the `#pragma once` directive has been supported by various implementations in varying degrees of compatibility, but it cannot be universally implemented because we don’t have a way of specifying what is the thing that should be included only “once” given the way that header search works.
> This issue does exist (and has for a very long time) but I don't see how it is relevant. Since header to source file mapping is implementation-defined, a dependency scanner or build system must match the implementation-defined behavior for the targeted implementations.

The dependency scanning needs to be able to map how the importable
header is specified in order to understand which `#include` directives
can be turned into imports.

That means we need to establish an identity between the files actually
opened by the dependency scanning process to the tokens used in both
`#include` and `import` directives before the dependency scanning even
runs.

> Section 2.2 states:
> > The cost of that approach, however, is that we create a significant bottleneck in the dependency chain of any given object. Changing the list of Importable Headers or the Local Preprocessor Arguments for any one of them will result in a complete invalidation of the dependency scanning for all translation units in the project.
> I think the bold text is not quite correct. Changing whether a header is importable or not only invalidates the TUs that include/import it.

Not quite, it invalidates the dependency scanning itself, because the
switch from source inclusion to import (even if it's still spelled as
`#include` in the source) can change the way the preprocessor handles
the code after the import, which can change the output of the
dependency scanning.

The problem, again, is that this information is needed before the
dependency scanning even runs. If that information is an input to the
dependency scanning, it means changing it invalidates it.

> Likewise, changing how a header unit is built only invalidates the TUs that import it (which may affect additional TUs if the importing TU is also an importable one).

Again, changing the initial preprocessor state of the importable
header unit invalidates the dependency scanning itself for the entire
project, since the dependency scanning has to emulate the behavior of
the import. And the dependency scanning needs to run before we know
anything else about the code.

> So, instead of "will result", I would substitute "might result". I don't see any reason why a build system can't cache these results and update them when they are found to be violated;

I'm not sure what you mean by that. The way that this works for named
modules is:

1. You start with the compiler command line for the translation unit,
and use that to calculate the command line for dependency scanning
(input is only the build configuration)
2. The dependency scanning for that translation unit produces P1689.
The inputs at this point are just the command line options and the
contents of the file. The output is only module-level dependencies,
because source inclusion is only needed to augment the incremental
build.
3. The dependency information is collated by the build system in order
to generate the input module map for each TU. This, again, does not
depend on source inclusion information. And if the build system uses a
predictable naming scheme for modules, it doesn't need to know how the
other TU was produced, it can just leave the dependency edge to be
tracked by the build system.

The end result is that for the initial build, the only inputs for a
translation unit are the command line, the sources for the translation
unit and the files provided by other targets in the build system that
are listed in the module map.

When Importable Headers are introduced, we have an input to the
dependency scanning process, which means that the rule to produce the
module mapper depends on the list of importable headers and the
arguments to those. The module mapper data is an input to the actual
translation, which will invalidate the translation unit itself.

The end result is that any change to that input invalidates the builds
of all translation units in the project.

Some build systems may be able to work around that issue in various
ways. The way Clang Header Modules and the early MSVC adoption works
around it is by assuming that the dependency scanning doing a plain
source inclusion is equivalent to the actual dependencies that will be
needed later. It is considered a user error if that is not true.

IMHO, that is not a "correct" implementation of the specification. The
specification requires a correct dependency scanning to emulate the
import, by starting a new preprocessor context based on the arguments
used when translating the header unit and merging the final state into
the state of the TU doing the import. Which means it needs to know the
list of header units and the arguments for each of those.

Doing the correct thing will mean any environment using a build system
that doesn't know how to optimize based on the contents of the output
of each command will have to invalidate the entire build whenever
information about importable headers change.

> I would like to see some real numbers before concluding that there is a problem to be solved though.

A change in the information about importable headers causing the
entire build to be invalidated is catastrophically bad, and will be
completely unaffordable in many environments where C++ is used.

> Header units don't have to perform as well as named modules; they just have to perform better than source inclusion and have an adoption cost less than named modules to be a potentially attractive and viable solution.

The comparison is not on Header Units versus Source Inclusion, the
comparison is on what happens in the real adoption on existing build
systems without the need to completely reinvent how those work.

> Section 2.3 states:
> > This is going to be particularly challenging if the ecosystem ends up in a situation where different compilers make different choices about how to handle the implicit replacement of `#include` by the equivalent `import`.
> Every implementation will need to be told which header files are importable.

I don't mean how they take it as input. But it is
implementation-dependent whether an `#include` is or isn't replaced by
the equivalent `import`, which means that the same project using the
same information about what are the importable headers to be used
could still end up with different results.

> Section 3.1 states:
> > The main restriction that enables a interoperable use of pre-compiled headers is that the translation unit has to use it as a preamble to the translation unit, meaning the precompiled header is guaranteed not to be influenced by any other code in the translation unit
> I think I understand what you are trying to say here, but I find the wording awkward.

The important bit about pre-compiled headers is that the source
inclusion is fundamentally equivalent to using the pre-compiled
header. This is not true for Clang Header Modules (in principle, it's
assumed the user will only choose to declare clang header modules for
cases where that should be true). It is also not true for Importable
Headers in general. There are two entirely different semantics which
can result in entirely different programs.

> The first sentence is not quite true. Clang modules allow for BMI creation and selection to be dependent on macro definitions defined for the importing TU. See the config_macros declaration.

The important point is that the working of the preprocessor is
fundamentally different from that of source inclusion. And that so far
all early experiences rely on the user not selecting the wrong header
to be importable, and there is no mechanism to detect whether the user
chose the wrong thing or not. It will just fail to compile if you're
lucky, or result in ODR violations otherwise.

daniel

Received on 2023-05-23 15:54:45

sg15