sg15: Re: [SG15] [isocpp-modules] FW: BMI distribution (reuse) and reading BMI data

From: Olga Arkhipova <olgaark_at_[hidden]>
Date: Tue, 11 Jun 2019 23:30:29 +0000

Thanks Tom.
Yes, looks like getting all build info/environment/sources from a BMI will be needed in many scenarios.

In one the mails Beg Craig asked:
>> What are the current restrictions with regards to static libraries and link time optimization?
>>Do those restrictions also apply to modules and link time optimizations?

The word from our compiler guys: we don't recommend distributing static libraries built with /GL as they can be used only with the MSVC toolset of exactly same version (even same build number) which built them. Otherwise there is no guarantee it will work.

The module .obj (if it is produced as a part of module build), definitely has the same restrictions.

Boris Kolpackov wrote:

>>Hm, never thought of BMIs as being similar to static libraries.

>>To me, static libraries supply the implementation while BMIs supply the interface as well as the "inline implementation"

>>(i.e., implementation that is expected to be compiled by the consumer, not the supplier).

>>From this perspective, BMIs seems to be more like precompiled headers than static libraries.

Yes, currently BMIs are more like precompiled headers - but can they be better/more reusable?

What I also tried to say comparing BMIs and static libs is that if people are already using static libraries build by somebody else (i.e. they use matching build tools), they should be able to use built modules for those libraries as well. This means that library vendors might include BMIs together with the module sources and static libs and the build consuming those libraries might choose to use the BMIs instead of rebuilding them.

For instance, MSVC is shipping BMIs (.ifc) for standard libraries (as an experimental feature). As they come together with the toolset and don't depend on anything else, it is easier to use them than in general case, of course. But I guess there might be other similar cases.

Olga

From: Modules <modules-bounces_at_[hidden]> On Behalf Of Tom Honermann
Sent: Friday, June 7, 2019 9:52 AM
To: modules_at_[hidden]; sg15_at_[hidden]
Cc: Olga Arkhipova <olgaark_at_[hidden]>
Subject: Re: [isocpp-modules] FW: BMI distribution (reuse) and reading BMI data

Thanks for forwarding, Olga.

I talked about the impact to Coverity at that last teleconference. The following summarizes that discussion.

Coverity is a static source code analyzer. In order for us to analyze source code, we need to parse it and interpret it the same way (or as close as possible) as the native compilers used to compile it do (this is necessary to avoid false positives due to implementation defined things). That means that we have to emulate every version of every compiler that we support. We support *many* C++ compilers (there are a lot more compilers than just gcc, Clang, and MSVC).

We learn how to parse a particular source code base by monitoring a build process in action and observing compiler invocations. We refer to this as "capturing" the build. For each observed compiler invocation, we analyze the command line for the native compiler and translate that to an equivalent command line for invoking our own front end with all the options needed to emulate the native compiler. We then invoke our own front end to construct an AST and store it in a database for later analysis.

As long as we observe construction of a BMI, we don't have a problem as we get the opportunity to construct our own BMIs for later use and we can figure out how to map consumption of a previously produced BMI to our own variant. We already do this to support Pre-Compiled Header (PCH) files. However, use of BMIs not generated during the build pose a major problem for us unless we have some (reasonably easy) way to consume the BMI.

In general, directly consuming a BMI is not realistic for us. Most compiler implementors are not committing to a stable BMI format and it is not economically feasible for us to be able to consume M different BMI versions from N different BMI formats (N probably corresponding to the number of compiler implementors). If all implementors were to adopt a single stable BMI format, we could potentially consume them directly, but previous discussions among implementors have indicated this is very unlikely to happen.

Indirectly consuming a BMI may be feasible for us. For this to work, we would need to be able to interrogate a BMI and extract at least the following:

  1. The command line of the native compiler used to produce the BMI.
  2. The current working directory for the native compiler invocation.
  3. The set of environment variables active for the native compiler invocation (we need the full set of environment variables to ensure that we can unset environment variables that could influence behavior).

The indirect consumption approach implies that (all of) the source code must be available either at the original filesystem paths or within the BMI itself (the latter case would require us to implement our own virtual filesystem layer to access the bundled source code; we can't rely on extracting it to the customer's file system). In some cases, we may still require the native compiler version used to produce the BMI be installed on the system because we depend on probing the native compiler to determine how to emulate certain behaviors. However, if implementors constrain their support of distributable BMIs to matching versions of the native compiler, then this latter concern is not a problem (since the compiler version for the invocation that is consuming the BMI already matches (closely enough) the version used to produce the BMI.

The above information would enable us to, upon observing consumption of a BMI that we did not observe the construction of, to use the information from the BMI to construct our own variant. We still face some challenges in this area (for example, data races generating our own variant when capturing a parallel build), but I believe such technical difficulties are not a large impediment.

What would be helpful is if each implementor were to provide a utility with a common interface (like c++filt) that enabled extraction of the above information from a BMI. Perhaps the tool would produce JSON output with the above command line, wording directory, environment variables, etc... information and the ability to generate a .zip file with the bundled source code.

I would be interested in hearing from others how useful the above would be for other tool providers.
Tom.

On 5/31/19 6:37 PM, Olga Arkhipova via Modules wrote:
Re-sending to SG15 and modules lists
Also attached the replies/questions on the old tooling group.

We discussed the topic on the last SG15 meeting and I'd like to collect more opinions/questions/scenarios where BMI reuse or at least extracting some data might be necessary or beneficial.

Thanks,
Olga

From: Olga Arkhipova
Sent: Thursday, May 23, 2019 5:31 PM
To: Tooling_at_[hidden]<mailto:Tooling_at_[hidden]>; Gabriel Dos Reis <gdr_at_[hidden]<mailto:gdr_at_[hidden]>>; Anna Gringauze <annagrin_at_[hidden]<mailto:annagrin_at_[hidden]>>; Lukasz Mendakiewicz <lukaszme_at_[hidden]<mailto:lukaszme_at_[hidden]>>; Cameron DaCamara <Cameron.DaCamara_at_[hidden]<mailto:Cameron.DaCamara_at_[hidden]>>
Subject: BMI distribution and reading BMI data

Hi all,
I'd like to discuss the BMI usage and distribution (reuse) topics - can we do on tomorrow's SG15 meeting? Or later?
Thanks,
Olga

BMI distribution (reuse)

Currently, built modules (BMI) are very similar to static libraries from build perspective:
1. They are specific to the compiler version (i.e. can only be used by the compiler binary compatible with the one which produced them)
2. If they depend on other modules, their BMIs need to be present too for successful build.
3. A number of compiler switches which were used to build the module should match the compiler switches for the source which uses this module.

So distribution of BMIs currently has similar limitations as the distribution of built static libraries:
* has strict requirements on the compiler and other used libraries versions
* limited to the platforms, architectures and #defines it is built for.

The BMI distribution definitely has performance advantage for the builds which meet all restrictions and requirements, i.e. the same ones which can use built static libraries.

If BMI's restriction of the specific compiler and exact command line can be weakened somehow, or at least some data can be extracted from all BMIs, the performance advantage of the BMI distribution can be wider.

Scenarios where extracting at least some data from BMIs is needed

VS instellisense (EDG)
Visual Studio and VS Code support not only MSVC, but also clang and gcc.
VS is using EDG compiler as intellisense engine, which currently supports MSVC, Clang and gcc modes. As performance of EDG compilation is very critical, ideally, EDG should be able to use modules already built by MSVC, clang and gcc.

Linters (as-you-type code analysis) require additional data specified in the source (annotations, pragmas, attributes, contracts). Ideally, this information should be always be present in BMI, independent on whether the code has been compiled for analysis or code generation.
o Alternative: producing a new BMI for linters and IntelliSense, making it slower.

Note: MSVC will have an option to include the original input source (not TU-expanded) into the IFCs.

Clang-cl and clang-gcc
Currently, clang-cl is (almost) ABI compatible with cl. I believe the same is true for clang and gcc.
Should Clang-cl be able to use modules produced by MSVC and vice versa?
Should Clang-gcc be able to use modules produced by gcc and vice versa?

Build systems
If BMIs are distributed together with their sources (like modules for MS standard libs) build systems might want to check if the available BMIs are actually compatible with the current build settings and if not, produce a different BMI from the source

Static analysis (background code analysis, code analysis at build)
Static analysis often requires additional data computed from the source which is normally not stored in the BMI. Such additional information is produced by static analysis tools during a separate analysis phase of the module and needs to be stored into a different BMI file.
o Alternative 1: Always adding extra info which is not always needed is an unjustifiable performance expense.
o Alternative 2: Adding this information to already created BMI file creates build system complications.
The format of additional information is not defined at this point - the tools decide how to read and write the data. We recommend storing the data in a way that can be consumed by other tools and compilers.

Other scenarios?

BMI data to extract

At minimum, the following data should be extractable from any BMI

* Module source file name/location (as it was during the module build)
* Compilation options used to build BMI, especially the ones which have to match in the source using this module (#defines, etc.)
* Referenced modules and their BMIs (unless included into compilation options)
* Static analysis data
o Imported header units and the referenced header
o Additional information stored by static analysis tools
o Anything that is possible to extract from header files using a compiler frontend - i.e. types, symbols, ASTs, source information, annotations (declspecs), pragmas, attributes, etc (for consumption by code analysis)

Should we encourage the compiler vendors to provide a way to do this for the BMIs they produce?

_______________________________________________

Modules mailing list

Modules_at_[hidden]<mailto:Modules_at_[hidden]>

Subscription: http://lists.isocpp.org/mailman/listinfo.cgi/modules<https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.isocpp.org%2Fmailman%2Flistinfo.cgi%2Fmodules&data=02%7C01%7Colgaark%40microsoft.com%7C7cc9de5946aa42b59a5a08d6eb6889a4%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636955231603503216&sdata=HO%2B75yYoslwO7ug3G4iJVMaGzIf4omc0unoU%2BDcDR74%3D&reserved=0>

Link to this post: http://lists.isocpp.org/modules/2019/05/0430.php<https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.isocpp.org%2Fmodules%2F2019%2F05%2F0430.php&data=02%7C01%7Colgaark%40microsoft.com%7C7cc9de5946aa42b59a5a08d6eb6889a4%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636955231603513220&sdata=CFXxykmmIB0whvuW8%2B2Vecub8RAs%2B1y0k15iHNAoHPU%3D&reserved=0>

Received on 2019-06-11 18:32:18