sg15: Re: [Tooling] [isocpp-modules] Dependency information for module-aware build tools

From: Ben Boeckel <ben.boeckel_at_[hidden]>
Date: Wed, 6 Mar 2019 08:41:01 -0500

On Tue, Mar 05, 2019 at 12:50:36 -0800, Richard Smith wrote:
> On Mon, Mar 4, 2019 at 2:58 PM Ben Boeckel via Modules <
> modules_at_[hidden]> wrote:
> > For CMake support for C++ modules, I've patched GCC so it outputs
> > dependency information in a JSON format. Before going too far down this
> > road, I'd like to get feedback on the format. This is for the purposes
> > of being able to implement D1483R1[1] without requiring build tools to
> > implement a C++ parser and instead have the compiler do the "scan" step
> > described there.
>
> To be clear: this would be a generalization of the existing -M
> functionality (in principle, with or without -MD, but the expectation is
> that this would generally be used without -MD), to cover both dependency
> files and also dependency modules, with the understanding that this can be
> requested even (especially) in cases where the prerequisite modules have
> not yet been compiled. Right?

Yes, it's just another output format. The flags for GCC are
(essentially):

g++ $CXXFLAGS $CPPFLAGS -MD -MF $JSON_OUTPUT -fdep-format=trtbd -E -fmodules-ts $SOURCE -o /dev/null

The name `trtbd` comes from "TR to-be-determined" since I expect it to
be documented in the TR. This could certainly be cleaned up at some
point with better flag spelling, but that's not my main goal at the
moment.

> > - "raw8": interpret `data` as an array of uint8_t bytes to be passed
> > to platform-specific filesystem APIs as an 8-bit encoding
> > - "raw16": interpret `data` as an array of uint16_t bytes to be passed
> > to platform-specific filesystem APIs as a 16-bit encoding
> > - "raw32": interpret `data` as an array of uint32_t bytes to be passed
> > to platform-specific filesystem APIs as a 32-bit encoding
>
> Are there any platforms that have APIs expecting a 32-bit encoding? I would
> expect UTF-8 (hopefully the common case), raw8 (for non-Windows), and raw16
> (for Windows) to be the only formats we need.

I'm OK with that. A future version could certainly add the `raw32`
format.

> > This basically means "check if it is UTF-8, if it is, escape `\` and `"`
> > and output that, otherwise indicate the byte size of the data and write
> > it as an integer array".
> >
> > In the future, we can add additional formats as a revision bump.
> >
> > So,
> >
> > - Is anything missing from this format?
> > - Is there any issue with getting this information from compilers?
> >
>
> A compiler will need to be told which header files are to be treated as
> modular headers. But I think that's a separate problem from the one you're
> solving here, and we should have a separate data format for the build
> system to use to configure the compiler to perform a build (including

I'm expecting `#include` to always be preprocessor textual includes at
the moment. If a header should be treated as a module, use `import`.

> specifying which headers are modular, what flags -- especially -D flags --
> to use to build them, how to map from module names to BMI names, and so on).

In my mind, flags to build header modules are provided by the compile
rule that generates the BMI for them. If the compiler cannot find the
BMI for a header module, I want it to error. If the given BMI is not
valid for the current compilation (e.g., incompatible compile flags), I
want it to error. The module map format for GCC is nice. Last I heard,
Clang module maps are basically response files with the special compiler
flags? That works just as well too.

> - Are any of the constraints too onerous on compilers?
>
> I don't think it makes sense for the compiler to tell the build system
> about input and output BMI files, especially not during a prescan phase --
> deciding where intermediate input and output files live should be the build
> system's job, not the compiler's. It also seems unnecessary to me for the
> compiler to repeat information that it was told on its command line, if the
> purpose of this information is just to feed back into the build system's
> dependency collection phase. So I think the only information that we need
> here is a list of dependency files and a list of dependency module names.

We also need the provided module name(s). The preprocessor dependency
information is also important to know when the scan needs to rerun
(standard -MD stuff). I'll think on how the format would look if the
compiler is going to leave module BMI file names completely up to the
build tool would look (gut feeling is that `logical-provides` is not
there at all and `provides` is a list of module names, not BMI
filenames; the build tool recognizes this and makes whatever names it
wants). GCC has default file names and I'm fine with that (CMake
ultimately chooses the output directory via `-o` for the object and the
module `-fmodule-mapper=` file it generates for BMI files to the compile
command).

> > - Are there any constraints which should be added to make it even
> > easier for build tools to parse/interpret this format?
> > - For non-UTF-8 data, do we want to default to `raw8` format without
> > one specified? Or should it always be required?
>
> Maybe we could allow either a string or an array of integers, where those
> integers will be of the right width for the platform-specific file system
> APIs? (It doesn't seem like we need to specify the format in the JSON file,
> since we don't expect to be able to use the JSON file on a machine running
> on a different platform.)

How do you know whether you're supposed to call CreateFileA or
CreateFileW from just an integer array? What if it is wide, but
all integer values happen to fit in 8 bits? My gut feeling is that
codepages may interfere here, but maybe it is just simple integer
promotion?

> > - Are non-UTF-8 module names valid? Does anyone know what SG16 is
> > saying about Unicode identifiers (which I presume would affect
> > module names as well)?
>
> C++ has allowed Unicode characters in identifiers (with some restrictions,
> see http://eel.is/c++draft/lex.name#1) since C++98, and has never allowed
> non-Unicode characters in identifiers. All module names can be encoded in
> UTF-8, and I don't think we need to plan for the eventuality that that
> changes.

That's great news.

Thanks,

--Ben

Received on 2019-03-06 14:41:04