Subject: Re: [SG15] [isocpp-modules] FW: BMI distribution (reuse) and reading BMI data
From: Tom Honermann (tom_at_[hidden])
Date: 2019-06-07 11:52:26
Thanks for forwarding, Olga.
I talked about the impact to Coverity at that last teleconference. The
following summarizes that discussion.
Coverity is a static source code analyzer. In order for us to analyze
source code, we need to parse it and interpret it the same way (or as
close as possible) as the native compilers used to compile it do (this
is necessary to avoid false positives due to implementation defined
things). That means that we have to emulate every version of every
compiler that we support. We support *many* C++ compilers (there are a
lot more compilers than just gcc, Clang, and MSVC).
We learn how to parse a particular source code base by monitoring a
build process in action and observing compiler invocations. We refer to
this as "capturing" the build. For each observed compiler invocation,
we analyze the command line for the native compiler and translate that
to an equivalent command line for invoking our own front end with all
the options needed to emulate the native compiler. We then invoke our
own front end to construct an AST and store it in a database for later
As long as we observe construction of a BMI, we don't have a problem as
we get the opportunity to construct our own BMIs for later use and we
can figure out how to map consumption of a previously produced BMI to
our own variant. We already do this to support Pre-Compiled Header (PCH)
files. However, use of BMIs not generated during the build pose a major
problem for us unless we have some (reasonably easy) way to consume the BMI.
In general, directly consuming a BMI is not realistic for us. Most
compiler implementors are not committing to a stable BMI format and it
is not economically feasible for us to be able to consume M different
BMI versions from N different BMI formats (N probably corresponding to
the number of compiler implementors). If all implementors were to adopt
a single stable BMI format, we could potentially consume them directly,
but previous discussions among implementors have indicated this is very
unlikely to happen.
Indirectly consuming a BMI may be feasible for us. For this to work, we
would need to be able to interrogate a BMI and extract at least the
1. The command line of the native compiler used to produce the BMI.
2. The current working directory for the native compiler invocation.
3. The set of environment variables active for the native compiler
invocation (we need the full set of environment variables to ensure
that we can unset environment variables that could influence behavior).
The indirect consumption approach implies that (all of) the source code
must be available either at the original filesystem paths or within the
BMI itself (the latter case would require us to implement our own
virtual filesystem layer to access the bundled source code; we can't
rely on extracting it to the customer's file system). In some cases, we
may still require the native compiler version used to produce the BMI be
installed on the system because we depend on probing the native compiler
to determine how to emulate certain behaviors. However, if implementors
constrain their support of distributable BMIs to matching versions of
the native compiler, then this latter concern is not a problem (since
the compiler version for the invocation that is consuming the BMI
already matches (closely enough) the version used to produce the BMI.
The above information would enable us to, upon observing consumption of
a BMI that we did not observe the construction of, to use the
information from the BMI to construct our own variant. We still face
some challenges in this area (for example, data races generating our own
variant when capturing a parallel build), but I believe such technical
difficulties are not a large impediment.
What would be helpful is if each implementor were to provide a utility
with a common interface (like c++filt) that enabled extraction of the
above information from a BMI. Perhaps the tool would produce JSON
output with the above command line, wording directory, environment
variables, etc... information and the ability to generate a .zip file
with the bundled source code.
I would be interested in hearing from others how useful the above would
be for other tool providers.
On 5/31/19 6:37 PM, Olga Arkhipova via Modules wrote:
> Re-sending to SG15 and modules lists
> Also attached the replies/questions on the old tooling group.
> We discussed the topic on the last SG15 meeting and Id like to
> collect more opinions/questions/scenarios where BMI reuse or at least
> extracting some data might be necessary or beneficial.
> *From:* Olga Arkhipova
> *Sent:* Thursday, May 23, 2019 5:31 PM
> *To:* Tooling_at_[hidden]
> <mailto:Tooling_at_[hidden]>; Gabriel Dos Reis
> <gdr_at_[hidden] <mailto:gdr_at_[hidden]>>; Anna Gringauze
> <annagrin_at_[hidden] <mailto:annagrin_at_[hidden]>>; Lukasz
> Mendakiewicz <lukaszme_at_[hidden] <mailto:lukaszme_at_[hidden]>>;
> Cameron DaCamara <Cameron.DaCamara_at_[hidden]
> *Subject:* BMI distribution and reading BMI data
> Hi all,
> Id like to discuss the BMI usage and distribution (reuse) topics
> can we do on tomorrows SG15 meeting? Or later?
> *BMI distribution (reuse)*
> Currently, built modules (BMI) are very similar to static libraries
> from build perspective:
> 1.They are specific to the compiler version (i.e. can only be used by
> the compiler binary compatible with the one which produced them)
> 2.If they depend on other modules, their BMIs need to be present too
> for successful build.
> 3.A number of compiler switches which were used to build the module
> should match the compiler switches for the source which uses this module.
> So distribution of BMIs currently has similar limitations as the
> distribution of built static libraries:
> ·has strict requirements on the compiler and other used libraries versions
> ·limited to the platforms, architectures and #defines it is built for.
> The BMI distribution definitely has performance advantage for the
> builds which meet all restrictions and requirements, i.e. the same
> ones which can use built static libraries.
> If BMI's restriction of the specific compiler and exact command line
> can be weakened somehow, or at least some data can be extracted from
> all BMIs, the performance advantage of the BMI distribution can be wider.
> *Scenarios where extracting at least some data from BMIs is needed*
> VS instellisense (EDG)
> Visual Studio and VS Code support not only MSVC, but also clang and gcc.
> VS is using EDG compiler as intellisense engine, which currently
> supports MSVC, Clang and gcc modes. As performance of EDG compilation
> is very critical, ideally, EDG should be able to use modules already
> built by MSVC, clang and gcc.
> Linters (as-you-type code analysis) require additional data specified
> in the source (annotations, pragmas, attributes, contracts). Ideally,
> this information should be always be present in BMI, independent on
> whether the code has been compiled for analysis or code generation.
> oAlternative: producing a new BMI for linters and IntelliSense, making
> it slower.
> Note: MSVC will have an option to include the original input source
> (not TU-expanded) into the IFCs.
> Clang-cl and clang-gcc
> Currently, clang-cl is (almost) ABI compatible with cl. I believe the
> same is true for clang and gcc.
> Should Clang-cl be able to use modules produced by MSVC and vice versa?
> Should Clang-gcc be able to use modules produced by gcc and vice versa?
> Build systems
> If BMIs are distributed together with their sources (like modules for
> MS standard libs) build systems might want to check if the available
> BMIs are actually compatible with the current build settings and if
> not, produce a different BMI from the source
> Static analysis (background code analysis, code analysis at build)
> Static analysis often requires additional data computed from the
> source which is normally not stored in the BMI. Such additional
> information is produced by static analysis tools during a separate
> analysis phase of the module and needs to be stored into a different
> BMI file.
> oAlternative 1: Always adding extra info which is not always needed is
> an unjustifiable performance expense.
> oAlternative 2: Adding this information to already created BMI file
> creates build system complications.
> The format of additional information is not defined at this point
> the tools decide how to read and write the data. We recommend storing
> the data in a way that can be consumed by other tools and compilers.
> Other scenarios?
> *BMI data to extract*
> At minimum, the following data should be extractable from any BMI
> ·Module source file name/location (as it was during the module build)
> ·Compilation options used to build BMI, especially the ones which have
> to match in the source using this module (#defines, etc.)
> ·Referenced modules and their BMIs (unless included into compilation
> ·Static analysis data
> oImported header units and the referenced header
> oAdditional information stored by static analysis tools
> oAnything that is possible to extract from header files using a
> compiler frontend i.e. types, symbols, ASTs, source information,
> annotations (declspecs), pragmas, attributes, etc (for consumption by
> code analysis)
> Should we encourage the compiler vendors to provide a way to do this
> for the BMIs they produce?
> Modules mailing list
> Subscription: http://lists.isocpp.org/mailman/listinfo.cgi/modules
> Link to this post: http://lists.isocpp.org/modules/2019/05/0430.php
SG15 list run by email@example.com