Date: Mon, 4 Mar 2019 17:57:53 -0500
Hi,
For CMake support for C++ modules, I've patched GCC so it outputs
dependency information in a JSON format. Before going too far down this
road, I'd like to get feedback on the format. This is for the purposes
of being able to implement D1483R1[1] without requiring build tools to
implement a C++ parser and instead have the compiler do the "scan" step
described there.
{ //
"outputs": [ // Files to be output for this
"source.o" // compilation[2].
], //
"provides": [ // BMI files provided by this
"I.gcm" // compilation.
], //
"logical-provides": { // Mapping of module names provided
"I": "I.gcm" // to provided BMI files.
}, //
"requires": [ // Modules names required by this
"M" // compilation.
], //
"depends": [ // Preprocessor dependency files
"../path/to/source.cpp", // which affect this scan (so it can
"/usr/include/stdc-predef.h" // be rerun if necessary).
], //
"version": 0, // The file format version.
"revision": 1 // The file format revision.
} //
This example output is for a file with the contents:
export module I;
import M;
export int i() {
return m();
}
My existing patch to GCC is currently missing `revision` and uses
`version` == 1 (but my CMake patches also don't check the field right
now). I'd like to get a Clang patch written up in the next few weeks.
Points to note:
- All top-level types are as-is and the key names are never localized:
* `outputs`: array
* `provides`: array
* `logical-provides`: object
* `requires`: array
* `depends`: array
* `version`: int
* `revision`: int
- Values are strings if the name or path is valid UTF-8. The keys of
`logical-provides` must be strings, therefore `requires` must also
be only strings since these are used as lookup keys to find the
on-disk file representing the listed `provide` (shouldn't be an
issue since these are module names).
- In the case of invalid UTF-8, an object is used with the following
layout (all data here is literal and not localized):
{ //
"format": "...", // The format of the data.
"data": [...] // Array of integers interpreted as
} // the appropriate integer size.
- Relative paths are relative to the working directory of the
compiler. Build tools may need to rewrite paths for the build tool
to actually understand them.[3]
- `version` is bumped if there is any semantic data added (e.g., more
information which is required to get a correct build), types change,
etc.
- `revision` is bumped if additionally helpful, but not semantically
important, field is added to the format.
Defined formats (I'm fine with bikeshedding these names once the overall
format has been hammered out):
- "raw8": interpret `data` as an array of uint8_t bytes to be passed
to platform-specific filesystem APIs as an 8-bit encoding
- "raw16": interpret `data` as an array of uint16_t bytes to be passed
to platform-specific filesystem APIs as a 16-bit encoding
- "raw32": interpret `data` as an array of uint32_t bytes to be passed
to platform-specific filesystem APIs as a 32-bit encoding
This basically means "check if it is UTF-8, if it is, escape `\` and `"`
and output that, otherwise indicate the byte size of the data and write
it as an integer array".
In the future, we can add additional formats as a revision bump.
So,
- Is anything missing from this format?
- Is there any issue with getting this information from compilers?
- Are any of the constraints too onerous on compilers?
- Are there any constraints which should be added to make it even
easier for build tools to parse/interpret this format?
- For non-UTF-8 data, do we want to default to `raw8` format without
one specified? Or should it always be required?
- Are non-UTF-8 module names valid? Does anyone know what SG16 is
saying about Unicode identifiers (which I presume would affect
module names as well)?
Thanks,
--Ben
[1]https://mathstuf.fedorapeople.org/fortran-modules/fortran-modules.html
[2]Note that some flags to GCC can cause it to output multiple files for
a compilation step (such as -fsplit-dwarf). It is my hope that such
flags can be wired up to this facility in the future as I don't think it
is done right now.
[3]Additionally, CMake takes the `.gcm` files and places it elsewhere in
the tree via GCC's module map files. Object files are also placed via
`-o` which is otherwise occupied during the scan step at the moment, so
their paths may also need to be reinterpreted at the build tool level.
For CMake support for C++ modules, I've patched GCC so it outputs
dependency information in a JSON format. Before going too far down this
road, I'd like to get feedback on the format. This is for the purposes
of being able to implement D1483R1[1] without requiring build tools to
implement a C++ parser and instead have the compiler do the "scan" step
described there.
{ //
"outputs": [ // Files to be output for this
"source.o" // compilation[2].
], //
"provides": [ // BMI files provided by this
"I.gcm" // compilation.
], //
"logical-provides": { // Mapping of module names provided
"I": "I.gcm" // to provided BMI files.
}, //
"requires": [ // Modules names required by this
"M" // compilation.
], //
"depends": [ // Preprocessor dependency files
"../path/to/source.cpp", // which affect this scan (so it can
"/usr/include/stdc-predef.h" // be rerun if necessary).
], //
"version": 0, // The file format version.
"revision": 1 // The file format revision.
} //
This example output is for a file with the contents:
export module I;
import M;
export int i() {
return m();
}
My existing patch to GCC is currently missing `revision` and uses
`version` == 1 (but my CMake patches also don't check the field right
now). I'd like to get a Clang patch written up in the next few weeks.
Points to note:
- All top-level types are as-is and the key names are never localized:
* `outputs`: array
* `provides`: array
* `logical-provides`: object
* `requires`: array
* `depends`: array
* `version`: int
* `revision`: int
- Values are strings if the name or path is valid UTF-8. The keys of
`logical-provides` must be strings, therefore `requires` must also
be only strings since these are used as lookup keys to find the
on-disk file representing the listed `provide` (shouldn't be an
issue since these are module names).
- In the case of invalid UTF-8, an object is used with the following
layout (all data here is literal and not localized):
{ //
"format": "...", // The format of the data.
"data": [...] // Array of integers interpreted as
} // the appropriate integer size.
- Relative paths are relative to the working directory of the
compiler. Build tools may need to rewrite paths for the build tool
to actually understand them.[3]
- `version` is bumped if there is any semantic data added (e.g., more
information which is required to get a correct build), types change,
etc.
- `revision` is bumped if additionally helpful, but not semantically
important, field is added to the format.
Defined formats (I'm fine with bikeshedding these names once the overall
format has been hammered out):
- "raw8": interpret `data` as an array of uint8_t bytes to be passed
to platform-specific filesystem APIs as an 8-bit encoding
- "raw16": interpret `data` as an array of uint16_t bytes to be passed
to platform-specific filesystem APIs as a 16-bit encoding
- "raw32": interpret `data` as an array of uint32_t bytes to be passed
to platform-specific filesystem APIs as a 32-bit encoding
This basically means "check if it is UTF-8, if it is, escape `\` and `"`
and output that, otherwise indicate the byte size of the data and write
it as an integer array".
In the future, we can add additional formats as a revision bump.
So,
- Is anything missing from this format?
- Is there any issue with getting this information from compilers?
- Are any of the constraints too onerous on compilers?
- Are there any constraints which should be added to make it even
easier for build tools to parse/interpret this format?
- For non-UTF-8 data, do we want to default to `raw8` format without
one specified? Or should it always be required?
- Are non-UTF-8 module names valid? Does anyone know what SG16 is
saying about Unicode identifiers (which I presume would affect
module names as well)?
Thanks,
--Ben
[1]https://mathstuf.fedorapeople.org/fortran-modules/fortran-modules.html
[2]Note that some flags to GCC can cause it to output multiple files for
a compilation step (such as -fsplit-dwarf). It is my hope that such
flags can be wired up to this facility in the future as I don't think it
is done right now.
[3]Additionally, CMake takes the `.gcm` files and places it elsewhere in
the tree via GCC's module map files. Object files are also placed via
`-o` which is otherwise occupied during the scan step at the moment, so
their paths may also need to be reinterpreted at the build tool level.
Received on 2019-03-04 23:57:57