sg15: [Tooling] Dependency information for module-aware build tools

From: Ben Boeckel <ben.boeckel_at_[hidden]>
Date: Mon, 4 Mar 2019 17:57:53 -0500

Hi,

For CMake support for C++ modules, I've patched GCC so it outputs
dependency information in a JSON format. Before going too far down this
road, I'd like to get feedback on the format. This is for the purposes
of being able to implement D1483R1[1] without requiring build tools to
implement a C++ parser and instead have the compiler do the "scan" step
described there.

    { //
    "outputs": [ // Files to be output for this
    "source.o" // compilation[2].
    ], //
    "provides": [ // BMI files provided by this
    "I.gcm" // compilation.
    ], //
    "logical-provides": { // Mapping of module names provided
    "I": "I.gcm" // to provided BMI files.
    }, //
    "requires": [ // Modules names required by this
    "M" // compilation.
    ], //
    "depends": [ // Preprocessor dependency files
    "../path/to/source.cpp", // which affect this scan (so it can
    "/usr/include/stdc-predef.h" // be rerun if necessary).
    ], //
    "version": 0, // The file format version.
    "revision": 1 // The file format revision.
    } //

This example output is for a file with the contents:

    export module I;
    import M;

    export int i() {
        return m();
    }

My existing patch to GCC is currently missing `revision` and uses
`version` == 1 (but my CMake patches also don't check the field right
now). I'd like to get a Clang patch written up in the next few weeks.

Points to note:

  - All top-level types are as-is and the key names are never localized:
    * `outputs`: array
    * `provides`: array
    * `logical-provides`: object
    * `requires`: array
    * `depends`: array
    * `version`: int
    * `revision`: int
  - Values are strings if the name or path is valid UTF-8. The keys of
    `logical-provides` must be strings, therefore `requires` must also
    be only strings since these are used as lookup keys to find the
    on-disk file representing the listed `provide` (shouldn't be an
    issue since these are module names).
  - In the case of invalid UTF-8, an object is used with the following
    layout (all data here is literal and not localized):

    { //
    "format": "...", // The format of the data.
    "data": [...] // Array of integers interpreted as
    } // the appropriate integer size.

  - Relative paths are relative to the working directory of the
    compiler. Build tools may need to rewrite paths for the build tool
    to actually understand them.[3]
  - `version` is bumped if there is any semantic data added (e.g., more
    information which is required to get a correct build), types change,
    etc.
  - `revision` is bumped if additionally helpful, but not semantically
    important, field is added to the format.

Defined formats (I'm fine with bikeshedding these names once the overall
format has been hammered out):

  - "raw8": interpret `data` as an array of uint8_t bytes to be passed
    to platform-specific filesystem APIs as an 8-bit encoding
  - "raw16": interpret `data` as an array of uint16_t bytes to be passed
    to platform-specific filesystem APIs as a 16-bit encoding
  - "raw32": interpret `data` as an array of uint32_t bytes to be passed
    to platform-specific filesystem APIs as a 32-bit encoding

This basically means "check if it is UTF-8, if it is, escape `\` and `"`
and output that, otherwise indicate the byte size of the data and write
it as an integer array".

In the future, we can add additional formats as a revision bump.

So,

  - Is anything missing from this format?
  - Is there any issue with getting this information from compilers?
  - Are any of the constraints too onerous on compilers?
  - Are there any constraints which should be added to make it even
    easier for build tools to parse/interpret this format?
  - For non-UTF-8 data, do we want to default to `raw8` format without
    one specified? Or should it always be required?
  - Are non-UTF-8 module names valid? Does anyone know what SG16 is
    saying about Unicode identifiers (which I presume would affect
    module names as well)?

Thanks,

--Ben

[1]https://mathstuf.fedorapeople.org/fortran-modules/fortran-modules.html
[2]Note that some flags to GCC can cause it to output multiple files for
a compilation step (such as -fsplit-dwarf). It is my hope that such
flags can be wired up to this facility in the future as I don't think it
is done right now.
[3]Additionally, CMake takes the `.gcm` files and places it elsewhere in
the tree via GCC's module map files. Object files are also placed via
`-o` which is otherwise occupied during the scan step at the moment, so
their paths may also need to be reinterpreted at the build tool level.

Received on 2019-03-04 23:57:57