sg15: Re: [Tooling] [isocpp-modules] Dependency information for module-aware build tools

From: Mathias Stearn <redbeard0531+isocpp_at_[hidden]>
Date: Mon, 4 Mar 2019 18:37:58 -0500

It would also be nice to have something like "rebuild_hash": "opaque
string", such that if the hash is the same as a prior run, downstream nodes
don't need to be rebuilt. This would allow compilers to stuff information
in the BMI that shouldn't propagate rebuilds, such as comments that may
show up in warnings, but can't effect the compiled output. It may even be
worth having that as a per-output hash, to support cases using split-dwarf
(or other platform equivalents) when the debug info changes in a way that
would require re-running dwp, but not relinking downstream libraries or
binaries.

On Mon, Mar 4, 2019 at 5:57 PM Ben Boeckel via Modules <
modules_at_[hidden]> wrote:

> Hi,
>
> For CMake support for C++ modules, I've patched GCC so it outputs
> dependency information in a JSON format. Before going too far down this
> road, I'd like to get feedback on the format. This is for the purposes
> of being able to implement D1483R1[1] without requiring build tools to
> implement a C++ parser and instead have the compiler do the "scan" step
> described there.
>
> { //
> "outputs": [ // Files to be output for this
> "source.o" // compilation[2].
> ], //
> "provides": [ // BMI files provided by this
> "I.gcm" // compilation.
> ], //
> "logical-provides": { // Mapping of module names provided
> "I": "I.gcm" // to provided BMI files.
> }, //
> "requires": [ // Modules names required by this
> "M" // compilation.
> ], //
> "depends": [ // Preprocessor dependency files
> "../path/to/source.cpp", // which affect this scan (so it can
> "/usr/include/stdc-predef.h" // be rerun if necessary).
> ], //
> "version": 0, // The file format version.
> "revision": 1 // The file format revision.
> } //
>
> This example output is for a file with the contents:
>
> export module I;
> import M;
>
> export int i() {
> return m();
> }
>
> My existing patch to GCC is currently missing `revision` and uses
> `version` == 1 (but my CMake patches also don't check the field right
> now). I'd like to get a Clang patch written up in the next few weeks.
>
> Points to note:
>
> - All top-level types are as-is and the key names are never localized:
> * `outputs`: array
> * `provides`: array
> * `logical-provides`: object
> * `requires`: array
> * `depends`: array
> * `version`: int
> * `revision`: int
> - Values are strings if the name or path is valid UTF-8. The keys of
> `logical-provides` must be strings, therefore `requires` must also
> be only strings since these are used as lookup keys to find the
> on-disk file representing the listed `provide` (shouldn't be an
> issue since these are module names).
> - In the case of invalid UTF-8, an object is used with the following
> layout (all data here is literal and not localized):
>
> { //
> "format": "...", // The format of the data.
> "data": [...] // Array of integers interpreted as
> } // the appropriate integer size.
>
> - Relative paths are relative to the working directory of the
> compiler. Build tools may need to rewrite paths for the build tool
> to actually understand them.[3]
> - `version` is bumped if there is any semantic data added (e.g., more
> information which is required to get a correct build), types change,
> etc.
> - `revision` is bumped if additionally helpful, but not semantically
> important, field is added to the format.
>
> Defined formats (I'm fine with bikeshedding these names once the overall
> format has been hammered out):
>
> - "raw8": interpret `data` as an array of uint8_t bytes to be passed
> to platform-specific filesystem APIs as an 8-bit encoding
> - "raw16": interpret `data` as an array of uint16_t bytes to be passed
> to platform-specific filesystem APIs as a 16-bit encoding
> - "raw32": interpret `data` as an array of uint32_t bytes to be passed
> to platform-specific filesystem APIs as a 32-bit encoding
>
> This basically means "check if it is UTF-8, if it is, escape `\` and `"`
> and output that, otherwise indicate the byte size of the data and write
> it as an integer array".
>
> In the future, we can add additional formats as a revision bump.
>
> So,
>
> - Is anything missing from this format?
> - Is there any issue with getting this information from compilers?
> - Are any of the constraints too onerous on compilers?
> - Are there any constraints which should be added to make it even
> easier for build tools to parse/interpret this format?
> - For non-UTF-8 data, do we want to default to `raw8` format without
> one specified? Or should it always be required?
> - Are non-UTF-8 module names valid? Does anyone know what SG16 is
> saying about Unicode identifiers (which I presume would affect
> module names as well)?
>
> Thanks,
>
> --Ben
>
> [1]https://mathstuf.fedorapeople.org/fortran-modules/fortran-modules.html
> [2]Note that some flags to GCC can cause it to output multiple files for
> a compilation step (such as -fsplit-dwarf). It is my hope that such
> flags can be wired up to this facility in the future as I don't think it
> is done right now.
> [3]Additionally, CMake takes the `.gcm` files and places it elsewhere in
> the tree via GCC's module map files. Object files are also placed via
> `-o` which is otherwise occupied during the scan step at the moment, so
> their paths may also need to be reinterpreted at the build tool level.
> _______________________________________________
> Modules mailing list
> Modules_at_[hidden]
> Subscription: http://lists.isocpp.org/mailman/listinfo.cgi/modules
> Link to this post: http://lists.isocpp.org/modules/2019/03/0123.php
>

Received on 2019-03-05 00:38:11