sg15: Re: [SG15] Scandeps format post-P1689R3 discussion

From: Ben Boeckel <ben.boeckel_at_[hidden]>
Date: Fri, 21 May 2021 07:02:03 -0400

On Fri, May 21, 2021 at 01:08:11 +0000, Olga Arkhipova via SG15 wrote:
> I agree that figuring out all dependencies for various sources is not
> a trivial task, and it would be great to have some sort of common
> format which all tools would use.
> But I believe this should be a different proposal/discussion, as it is
> a generic problem and not specific to module dependency scanning.

Agreed about it being a different proposal.

> The "scan" json is used to build the dependency graph between the
> sources. The info about source dependencies (includes, etc) is not
> used there and can take significant time to read. The outputs info is
> not needed to create the dependency graph (between the sources)
> either. The BMI locations will be needed to construct the command
> lines for the actual (not "scan") build, but this info is already
> available for each source as a part of the command line, i.e. known to
> the build system. Also, changing outputs locations would not change
> the "logical" module dependencies in the source, so by not having
> outputs in the "scan" json, we can avoid unnecessary scanning when
> output locations change for one reason or another.

The primary output is (generally) knowable to the build tool (`/Fo:dir`
is a bit hokey, but the build tool *decided* to write that flag), but
the discovered outputs are not. These can include .dwo, .dSYM, .pdb, or
other such files.

CMake certainly has zero idea about which BMI files come from where
though, but it is something generally decideable by the compiler via
module mapping mechanisms.

> We are very open to format/field name changes, but we'd like to keep
> "scan" json as small as possible to minimize its reading/writing time
> in the perf critical IDE scenarios.
>
> We'd also like to ensure that we can distinguish named modules and
> header units without any additional parsing/guessing, as their
> handling is significantly different, at least for MSVC.

The `unique-on-source-path` is meant to help there. Distinguishing
between `<>` and `""` is a bit tougher without overflowing into "what
the heck is this field for" in other languages.

> I'd propose to have different fields for named modules and header
> units to avoid any confusion there. If we want to use the same "scan"
> format shared between Fortran and C++, it should be a union of
> language capabilities, not an intersection.
>
> The Fortran (or other languages) can simply use only a subset of the
> properties that make sense to them (same for C++).
>
> With Ben's proposal to remove `inputs`, `outputs`, and `depends`
> arrays from the "scan" json (thanks Ben!) the formats are pretty close
> from the info perspective. Currently proposed CMake format with the
> removals looks like the following (please correct me if I am wrong
> here):

There was further discussion and we had come up with these changes as
well:

  - `primary-output` -> optional in the format, but if the build tool
    mentions it, it should be put here. This is analogous to the `-MT`
    flag and is likely used to fit the output into existing
    infrastructure.
  - `work-directory` -> optional in the format, but should be
    controllable by the caller (a build tool that doesn't understand its
    work directory structure is very confused). This is probably needed
    for Intellisense, but is not strictly required by build tools
    themselves. If needed, the field can be added by the build tool.
    Note that this is because `getcwd()` is problematic because it
    doesn't understand `$PWD` with symlinks and will resolve them which
    can lead to mismatches if the build tool is using a `$PWD` name for
    a directory rather than the `realpath()`.[1]

> We were discussing internally and were going add an analog of
> "logical-name": "<header>" for header units which can potentially aid
> in some scenarios, so no disagreement for having it (or probably two
> different properties for "" and <>).
>
> I argue that obj and bmi output locations are already known to the
> build system as part of the source command line and are irrelevant to
> the "scan" compiler invocation which actually produces the json file
> (as the only output of that invocation).

Discovered outputs are very important for Fortran where the compiler is
the only thing that gets to determine the output names. The build tool
can only control the *directory* they get written to, but the filenames
themselves are under compiler control.

> So If we agree to not have obj and bmi output locations in the "scan"
> json or at least make them optional, we should be able to reconcile
> the formats.

We can make the field optional, but scanners should have ways of writing
them out if asked (this includes things like split debug files or other
such things that the build tool does *not* generally know about).

> I'm not too keen on the use of `<>` and `""` in the format itself
> either. It would be fine if we kept "boost/version.hpp" as not having
> to escape quotes. I'd be ok with the use of some optional boolean key
> to indicate that it was written as `<boost/version.hpp>`, but we
> shouldn't have to worry about properly escaping or roundtripping
> includes between JSON and... literally anything else.

How about a `lookup-method` with a limited set of values would make
sense? `by-name`, `by-local-path` (""), `by-path` (<>)?

--Ben

[1] As a note, CMake does this in order to support HPC login nodes where
`/home/user` is a symlink to `/pool/$randomid/home/user` which differs
based on the login node in use. In this case, the *symlink* name is the
stable one and is used instead of `realpath()`.

Received on 2021-05-21 06:02:07