sg15: Re: [SG15] Scandeps format post-P1689R3 discussion

From: Olga Arkhipova <olgaark_at_[hidden]>
Date: Fri, 21 May 2021 01:08:11 +0000

I agree that figuring out all dependencies for various sources is not a trivial task, and it would be great to have some sort of common format which all tools would use.
But I believe this should be a different proposal/discussion, as it is a generic problem and not specific to module dependency scanning.

Implementing module support for VC projects in Visual Studio, we've discussed the originally proposed format for module dependencies https://wg21.link/P1689, but then decided to have only "logical" info about provided and imported modules in the "scan" json (/sourceDependencies:directives (List module and header unit dependencies) | Microsoft Docs<https://docs.microsoft.com/en-us/cpp/build/reference/sourcedependencies-directives?view=msvc-160>), and to keep the includes, outputs and command line info in different files as they are used at different times and different places.

The "scan" json is used to build the dependency graph between the sources. The info about source dependencies (includes, etc) is not used there and can take significant time to read. The outputs info is not needed to create the dependency graph (between the sources) either. The BMI locations will be needed to construct the command lines for the actual (not "scan") build, but this info is already available for each source as a part of the command line, i.e. known to the build system. Also, changing outputs locations would not change the "logical" module dependencies in the source, so by not having outputs in the "scan" json, we can avoid unnecessary scanning when output locations change for one reason or another.

We are very open to format/field name changes, but we'd like to keep "scan" json as small as possible to minimize its reading/writing time in the perf critical IDE scenarios.

We'd also like to ensure that we can distinguish named modules and header units without any additional parsing/guessing, as their handling is significantly different, at least for MSVC.

I'd propose to have different fields for named modules and header units to avoid any confusion there. If we want to use the same "scan" format shared between Fortran and C++, it should be a union of language capabilities, not an intersection.

The Fortran (or other languages) can simply use only a subset of the properties that make sense to them (same for C++).

With Ben's proposal to remove `inputs`, `outputs`, and `depends` arrays from the "scan" json (thanks Ben!) the formats are pretty close from the info perspective.
Currently proposed CMake format with the removals looks like the following (please correct me if I am wrong here):

//my.module.cpp:
export module my.module;

import other.module;
import <header>;

#include "config.h"
a full output for scanning this source could be:
Example dependency output
{
  "version": 1,
  "revision": 0,
  "rules": [
    "work-directory": "/scanner/working/dir",
    "outputs": [
      "my.module.cpp.o",
      "my_module.bmi"
    ],
    "provides": [
      {
        "logical-name": "my.module",
        "source-path": "my.module.cpp",
        "compiled-module-path": "my_module.bmi"
      }
    ],
    "requires": [
      {
        "logical-name": "other.module"
      }
      {
        "logical-name": "<header>",
        "source-path": "/system/include/path/header",
      }
    }
  ]
}

The current MSVC "scan" json format for the same source sample would look like

{
   "Version":"1.1",
   "Data":{
      "Source":"/scanner/working/dir/my.module.cpp",
      "ProvidedModule":"my.module",
      "ImportedModules":[
         "other.module"
      ],
      "ImportedHeaderUnits":[
         "/system/include/path/header"
      ]
   }
}

We were discussing internally and were going add an analog of "logical-name": "<header>" for header units which can potentially aid in some scenarios, so no disagreement for having it (or probably two different properties for "" and <>).

I argue that obj and bmi output locations are already known to the build system as part of the source command line and are irrelevant to the "scan" compiler invocation which actually produces the json file (as the only output of that invocation).

So If we agree to not have obj and bmi output locations in the "scan" json or at least make them optional, we should be able to reconcile the formats.

Thanks,

Olga

From: SG15 <sg15-bounces_at_[hidden]> On Behalf Of Isabella Muerte via SG15
Sent: Thursday, May 13, 2021 2:11 AM
To: Ben Boeckel <sg15_at_[hidden]>
Cc: Isabella Muerte <imuerte_at_[hidden]>
Subject: Re: [SG15] Scandeps format post-P1689R3 discussion

Apologies. I had responded several days ago, but there was an issue with the list and my message was stuck in the moderator queue. I'm resending this as the issue was resolved, but the message is still stuck in the queue.

On Tue, May 11, 2021, at 12:15, Ben Boeckel via SG15 wrote:
Hi,

After discussing with some implementers about real-world usage of the r3
format. In these discussions the following things have been considered:

# Dropping the `inputs`, `outputs`, and `depends` arrays

These fields are intended to capture the dependency information of the
scanning step itself. The intended replacement is the normal dependency
mechanisms offered by compilers (be it `/showIncludes`, `-M`, or other
possible mechanisms). These are really only of use to the build tool
itself to know when to recompile the file, so putting it into the file
format itself is unnecessary.

I'm very much against relying on 'existing' mechanisms as none of them are uniform between implementations, nor even between locale implementations. I would be fine with having multiple formats, or even spinning these arrays out into their own file, but a uniform format for these is desperately needed, especially in multi-language repositories where C++ code might be wrapped for use in something like Node, Ruby, Python, etc. C++ needs to be usable outside of a purely C++ environment, and making other languages struggle when they can just call `cargo build` or even just `rustc <entrypoint-file>` instead is going to be a detriment.

In addition, there are two other reasons that have come up to at least
consider changing these keys:

  - the `depends` array may be large: removing it means that tools which
    are just looking for `future-compile` information do not need to
    parse and throw away these fields
  - it does not match the semantics in batch scanning: what is really
    wanted here is a top-level `inputs`, `outputs`, and `depends` set of
    keys because the batch scanning step itself has just one set of
    this information.

Since compilers, build tools, etc. already have mechanisms for
communicating this information around, reusing those rather than
injecting them into the format is likely a better solution.

# Moving `future-compile` keys up one level

Now that the other keys here are gone with the prior point
(`future-link` was dropped in r1 due to its dubious usefulness[1]),
`future-compile`'s keys `outputs`, `provides`, and `requires` keys can
be moved up one level.

# Clarification of header unit representations

C++ has a difference between header units and module units. However,
this is a C++-ism and the format would be useful to use in other
languages as well (namely Fortran) where such a dichotomy does not
exist. As such, formalizing the use of `<>` and `""` wrapping in the
`logical-name` key should be noted for use for C++. Due to the multiple
ways of spelling various includes (e.g., `"boost/version.hpp"`,
`<boost/version.hpp>` or even `<Boost/Version.hpp>` on case-insensitive
filesystems), the `source-path` should be used for linking such uses
together instead.

To do this, the `logical-name` should be the in-source name (for use in
module mapper or other such mechanisms for finding the BMI for a given
module by its name), `source-path` must be provided for header units.
I don't think specifying `<>` and `""` in the format itself is the
greatest, so an additional bool key named `unique-on-source-path` (open
to bikeshedding) would indicate that the `source-path` should be used
for linking between `provide` and `requires` lookups.

Thoughts?

I'm not too keen on the use of `<>` and `""` in the format itself either. It would be fine if we kept "boost/version.hpp" as not having to escape quotes. I'd be ok with the use of some optional boolean key to indicate that it was written as `<boost/version.hpp>`, but we shouldn't have to worry about properly escaping or roundtripping includes between JSON and... literally anything else.

--Ben

[1] The only usecase that we could come up with was MSVC's
`#pragma(comment)` for "autolinking".
_______________________________________________
SG15 mailing list
SG15_at_[hidden]<mailto:SG15%40lists.isocpp.org>
https://lists.isocpp.org/mailman/listinfo.cgi/sg15<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.isocpp.org%2Fmailman%2Flistinfo.cgi%2Fsg15&data=04%7C01%7Colgaark%40microsoft.com%7C4257018e646f4eb6042808d9160b1849%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637565059618594869%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&sdata=8NwdT85J3Bx8owa5ANWIDaUJUZbMj9ddty4huOVmQv4%3D&reserved=0>

Received on 2021-05-20 20:08:18