Date: Fri, 3 Jun 2022 02:18:27 +0000
To add to Olga's excellent summary:
- MSVC looks at <header> and "header" as logical names of the headers, as written in the source code. For example, <vector> is not the same as <bar/vector> even if both might resolved to the same physical find being found by the '#include' algorithm.
- MSVC looks at <drive:/absolute/path> or "drive:/absolute/path" as *hard coded* ID for the header unit.
MSVC recommends the standard notation of <header> or "header" as the preferred notation for headers (and it emits that in its BMI, IFC file). That allows relocation and other form of cloud builds where all that matters is what is written in the source code (for reproducibility), and not the exact location on the drive filesystem - imagine building in labs and distributing the result on consumers' machine, different from the fancy labs set up.
If you ask for what we (SG15) should recommend: the logical name as normally written in the input source file, NOT the physical location of the resolution of the logical header or header file.
-- Gaby
-----Original Message-----
From: SG15 <sg15-bounces_at_[hidden]> On Behalf Of Olga Arkhipova via SG15
Sent: Thursday, June 2, 2022 6:58 PM
To: sg15_at_[hidden]; Ben Boeckel <ben.boeckel_at_[hidden]>
Cc: Olga Arkhipova <olgaark_at_[hidden]>
Subject: Re: [SG15] "logical name" of importable headers
>> My main question is on how the build system communicates to the compiler which importable headers exist:
Yes, we've struggled with this question too and came up with the following:
cl.exe /headerUnit switch has the following options
/headerUnit header-filename=ifc-filename
/headerUnit:quote [header-filename=ifc-filename]
/headerUnit:angle [header-filename=ifc-filename]
https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.microsoft.com%2Fen-us%2Fcpp%2Fbuild%2Freference%2Fheaderunit%3Fview%3Dmsvc-170&data=05%7C01%7Cgdr%40microsoft.com%7C74a361e470d2439c712f08da450482a7%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637898182930186151%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=L5QH%2Bw%2Bp%2BR4VWTGNQwt1S9JoJxyBxdbf33zE2VL7e24%3D&reserved=0
In other words, header unit "logical name" can be a full path to .h or <a/b/header.h> or "header.h" forms similar to the ones used in the code, but does not require to exactly match the code usage (see below). The command line requires to contain all necessary -I to be able to find the imported .h, the same as for #include.
The compiler would resolve the imported .h using include path and do the same for /headerUnit <> and "" options to obtain full paths. Then it will use the full paths to match the import and the header unit specified on the command line.
In other words, the resolved header file path is used as a header unit ID.
As file path is unique, there is no ambiguity. This also allows some flexibility in header unit "logical names" - as soon as they are resolved to the same path, the header unit BMI will be used. As symlinks are different file system entities, they obviously will not be matched to non symlinks locations. But I believe this is no different than headers resolution today.
This does require the .h file (and not only BMI) to be present on the machine (or rather file system) as well as a set of -I on the command line. But this is not different from today's headers usage and should not be a big problem.
MSVC will not create BMIs on its own and always require them to be specified on the command line.
The build system knows which BMIs it needs to build from the following info:
- user directly specifying the headers to be built as header units
- scan data of the sources (if the build system supports automatic build of imported header units).
In the last case the build system will recursively scan all imported headers and use original source base compilation options for header units' creation if they don't already exist.
So to use a prebuilt header unit from a library the following will be needed
- Directory of the header (or its parent dir) should be added to the include path (no different than today)
- The "logical name" of the header unit (in the lib's metadata) would be <header.h> or <a/b/header.h> - whatever allows to find it in that directory. The full path can also be used if the library (and the header unit) is built on the user's machine.
I believe header units were designed to ease the transition from #includes to modules and from this perspective it is desirable to keep the resolution as similar as possible to what is used in #includes.
Thanks,
Olga
-----Original Message-----
From: SG15 <sg15-bounces_at_[hidden]> On Behalf Of Daniel Ruoso via SG15
Sent: Thursday, June 2, 2022 12:56
To: Ben Boeckel <ben.boeckel_at_[hidden]>
Cc: Daniel Ruoso <daniel_at_[hidden]>; Daniel Ruoso via SG15 <sg15_at_[hidden]>
Subject: Re: [SG15] "logical name" of importable headers
Em qui., 2 de jun. de 2022 ās 14:32, Ben Boeckel <ben.boeckel_at_[hidden]> escreveu:
> `BASE_DIRS` may not overlap, so each file has one and only one name
> relative to one of the base directories. This is what I think should
> be used for the name of any importable header.
That is a bit tangential to my question, IIUC.
My main question is on how the build system communicates to the compiler which importable headers exist:
Option 1: Name as it appears in the import statement, without the full path (i.e.: `<a/bad/name.h>`), which would mean that any import statement would consume the given header unit, regardless of what `-I` was given on the compiler command line (i.e.: filea.cpp works and imports bar's header unit), also assume that if the token after `#include` matches an importable header, it means the header unit, regardless of the `-I`.
Option 2: Option 1, but don't assume you can replace an `#include` by an `import`, since we don't actually have the path to the header file.
Option 3: Name it as the file logically formed by concatenating the `-I` with the `import` statement as well as ""-include rules. (i.e.:
`/opt/bb/include/bar/a/bad/name.h`), which means the compiler would need to resolve the path to a header that needs to be imported before matching it to the list of importable headers. This means that `import <bad/other.h>` and `import <a/bad/other.h>` would be equivalent, and so would be `import "other.h"` from the same directory. But `import <fun/a/bad/other.h>` would not work (for the case where `fun` is a symlink that is also in the `-I`).
Option 4: Option 3, but normalize the files with `realpath` or some other mechanism (e.g.: stat's device id + inode). This solves the problems with symlinks as well as the usage of `..` in the import or include, but it incurs a significant additional cost as canonicalizing all those files will potentially result in a very large number of system calls.
Option 5: Name it as a tuple of the name used in the import and the path where it was found (e.g.: `<a/bad/other.h>,/usr/include/bar`).
This means the compiler would still need to resolve the location of the imported header file, but the header unit would only be usable if it was imported as expected. This would also mean `import "other.h"` would not work unless it's explicitly declared that way, and it would be a separate header unit in that case.
Option 6: Option 5, but normalize the directories with `realpath` or some other mechanism (e.g.: stat's device id + inode). This would solve the problems with symlinks to the directories, as long as the import statement uses the same name.
None of those options seem like an obvious choice to me.
Option 1 would be the "cleanest", imho, but that is incredibly backwards-incompatible.
Option 2 would be a compromise on the backwards-incompatibility, but it would remove the "replace-include-by-import" optimization.
Option 4 would be the most backwards-compatible, but it's not clear to me that we want that much backwards compatibility for import statements, and it's likely very expensive.
Option 3 would remove the excessive cost of Option 4, but it would not be resilient to symlinks or the usage of `..`. The limits would show up as either failed imports or less clarity on when an include statement gets replaced by an import.
Options 5 and 6 are interesting compromise solutions, but they compromise a lot. They solve the semantic problem of not importing the header by the intended interface, but at the cost of the same header unit being translated many times, or simply fail the import statement entirely. It's also not going to be clear to the user when an include would be replaced by an import.
I'm interested in hearing where folks stand on that.
I am, personally, partial to Option 2. I know it's a neat optimization, but it generates a *lot* of complexity.
Going with Option 2 also means we wouldn't need to provide the list of importable headers to the dependency scanning step, since the output would no longer depend on that list.
It would also be more clear to the users, since in that case `#include` never depends on the module mapping, and `import` never depends on `-I`.
daniel
_______________________________________________
SG15 mailing list
SG15_at_[hidden]
https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.isocpp.org%2Fmailman%2Flistinfo.cgi%2Fsg15&data=05%7C01%7Cgdr%40microsoft.com%7C74a361e470d2439c712f08da450482a7%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637898182930186151%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=DWYIq9pabsgc1LiaVJOgTg16%2FW5Ghyl4%2F3vHFyPxKn8%3D&reserved=0
_______________________________________________
SG15 mailing list
SG15_at_[hidden]
https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.isocpp.org%2Fmailman%2Flistinfo.cgi%2Fsg15&data=05%7C01%7Cgdr%40microsoft.com%7C74a361e470d2439c712f08da450482a7%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637898182930186151%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=DWYIq9pabsgc1LiaVJOgTg16%2FW5Ghyl4%2F3vHFyPxKn8%3D&reserved=0
- MSVC looks at <header> and "header" as logical names of the headers, as written in the source code. For example, <vector> is not the same as <bar/vector> even if both might resolved to the same physical find being found by the '#include' algorithm.
- MSVC looks at <drive:/absolute/path> or "drive:/absolute/path" as *hard coded* ID for the header unit.
MSVC recommends the standard notation of <header> or "header" as the preferred notation for headers (and it emits that in its BMI, IFC file). That allows relocation and other form of cloud builds where all that matters is what is written in the source code (for reproducibility), and not the exact location on the drive filesystem - imagine building in labs and distributing the result on consumers' machine, different from the fancy labs set up.
If you ask for what we (SG15) should recommend: the logical name as normally written in the input source file, NOT the physical location of the resolution of the logical header or header file.
-- Gaby
-----Original Message-----
From: SG15 <sg15-bounces_at_[hidden]> On Behalf Of Olga Arkhipova via SG15
Sent: Thursday, June 2, 2022 6:58 PM
To: sg15_at_[hidden]; Ben Boeckel <ben.boeckel_at_[hidden]>
Cc: Olga Arkhipova <olgaark_at_[hidden]>
Subject: Re: [SG15] "logical name" of importable headers
>> My main question is on how the build system communicates to the compiler which importable headers exist:
Yes, we've struggled with this question too and came up with the following:
cl.exe /headerUnit switch has the following options
/headerUnit header-filename=ifc-filename
/headerUnit:quote [header-filename=ifc-filename]
/headerUnit:angle [header-filename=ifc-filename]
https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.microsoft.com%2Fen-us%2Fcpp%2Fbuild%2Freference%2Fheaderunit%3Fview%3Dmsvc-170&data=05%7C01%7Cgdr%40microsoft.com%7C74a361e470d2439c712f08da450482a7%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637898182930186151%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=L5QH%2Bw%2Bp%2BR4VWTGNQwt1S9JoJxyBxdbf33zE2VL7e24%3D&reserved=0
In other words, header unit "logical name" can be a full path to .h or <a/b/header.h> or "header.h" forms similar to the ones used in the code, but does not require to exactly match the code usage (see below). The command line requires to contain all necessary -I to be able to find the imported .h, the same as for #include.
The compiler would resolve the imported .h using include path and do the same for /headerUnit <> and "" options to obtain full paths. Then it will use the full paths to match the import and the header unit specified on the command line.
In other words, the resolved header file path is used as a header unit ID.
As file path is unique, there is no ambiguity. This also allows some flexibility in header unit "logical names" - as soon as they are resolved to the same path, the header unit BMI will be used. As symlinks are different file system entities, they obviously will not be matched to non symlinks locations. But I believe this is no different than headers resolution today.
This does require the .h file (and not only BMI) to be present on the machine (or rather file system) as well as a set of -I on the command line. But this is not different from today's headers usage and should not be a big problem.
MSVC will not create BMIs on its own and always require them to be specified on the command line.
The build system knows which BMIs it needs to build from the following info:
- user directly specifying the headers to be built as header units
- scan data of the sources (if the build system supports automatic build of imported header units).
In the last case the build system will recursively scan all imported headers and use original source base compilation options for header units' creation if they don't already exist.
So to use a prebuilt header unit from a library the following will be needed
- Directory of the header (or its parent dir) should be added to the include path (no different than today)
- The "logical name" of the header unit (in the lib's metadata) would be <header.h> or <a/b/header.h> - whatever allows to find it in that directory. The full path can also be used if the library (and the header unit) is built on the user's machine.
I believe header units were designed to ease the transition from #includes to modules and from this perspective it is desirable to keep the resolution as similar as possible to what is used in #includes.
Thanks,
Olga
-----Original Message-----
From: SG15 <sg15-bounces_at_[hidden]> On Behalf Of Daniel Ruoso via SG15
Sent: Thursday, June 2, 2022 12:56
To: Ben Boeckel <ben.boeckel_at_[hidden]>
Cc: Daniel Ruoso <daniel_at_[hidden]>; Daniel Ruoso via SG15 <sg15_at_[hidden]>
Subject: Re: [SG15] "logical name" of importable headers
Em qui., 2 de jun. de 2022 ās 14:32, Ben Boeckel <ben.boeckel_at_[hidden]> escreveu:
> `BASE_DIRS` may not overlap, so each file has one and only one name
> relative to one of the base directories. This is what I think should
> be used for the name of any importable header.
That is a bit tangential to my question, IIUC.
My main question is on how the build system communicates to the compiler which importable headers exist:
Option 1: Name as it appears in the import statement, without the full path (i.e.: `<a/bad/name.h>`), which would mean that any import statement would consume the given header unit, regardless of what `-I` was given on the compiler command line (i.e.: filea.cpp works and imports bar's header unit), also assume that if the token after `#include` matches an importable header, it means the header unit, regardless of the `-I`.
Option 2: Option 1, but don't assume you can replace an `#include` by an `import`, since we don't actually have the path to the header file.
Option 3: Name it as the file logically formed by concatenating the `-I` with the `import` statement as well as ""-include rules. (i.e.:
`/opt/bb/include/bar/a/bad/name.h`), which means the compiler would need to resolve the path to a header that needs to be imported before matching it to the list of importable headers. This means that `import <bad/other.h>` and `import <a/bad/other.h>` would be equivalent, and so would be `import "other.h"` from the same directory. But `import <fun/a/bad/other.h>` would not work (for the case where `fun` is a symlink that is also in the `-I`).
Option 4: Option 3, but normalize the files with `realpath` or some other mechanism (e.g.: stat's device id + inode). This solves the problems with symlinks as well as the usage of `..` in the import or include, but it incurs a significant additional cost as canonicalizing all those files will potentially result in a very large number of system calls.
Option 5: Name it as a tuple of the name used in the import and the path where it was found (e.g.: `<a/bad/other.h>,/usr/include/bar`).
This means the compiler would still need to resolve the location of the imported header file, but the header unit would only be usable if it was imported as expected. This would also mean `import "other.h"` would not work unless it's explicitly declared that way, and it would be a separate header unit in that case.
Option 6: Option 5, but normalize the directories with `realpath` or some other mechanism (e.g.: stat's device id + inode). This would solve the problems with symlinks to the directories, as long as the import statement uses the same name.
None of those options seem like an obvious choice to me.
Option 1 would be the "cleanest", imho, but that is incredibly backwards-incompatible.
Option 2 would be a compromise on the backwards-incompatibility, but it would remove the "replace-include-by-import" optimization.
Option 4 would be the most backwards-compatible, but it's not clear to me that we want that much backwards compatibility for import statements, and it's likely very expensive.
Option 3 would remove the excessive cost of Option 4, but it would not be resilient to symlinks or the usage of `..`. The limits would show up as either failed imports or less clarity on when an include statement gets replaced by an import.
Options 5 and 6 are interesting compromise solutions, but they compromise a lot. They solve the semantic problem of not importing the header by the intended interface, but at the cost of the same header unit being translated many times, or simply fail the import statement entirely. It's also not going to be clear to the user when an include would be replaced by an import.
I'm interested in hearing where folks stand on that.
I am, personally, partial to Option 2. I know it's a neat optimization, but it generates a *lot* of complexity.
Going with Option 2 also means we wouldn't need to provide the list of importable headers to the dependency scanning step, since the output would no longer depend on that list.
It would also be more clear to the users, since in that case `#include` never depends on the module mapping, and `import` never depends on `-I`.
daniel
_______________________________________________
SG15 mailing list
SG15_at_[hidden]
https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.isocpp.org%2Fmailman%2Flistinfo.cgi%2Fsg15&data=05%7C01%7Cgdr%40microsoft.com%7C74a361e470d2439c712f08da450482a7%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637898182930186151%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=DWYIq9pabsgc1LiaVJOgTg16%2FW5Ghyl4%2F3vHFyPxKn8%3D&reserved=0
_______________________________________________
SG15 mailing list
SG15_at_[hidden]
https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.isocpp.org%2Fmailman%2Flistinfo.cgi%2Fsg15&data=05%7C01%7Cgdr%40microsoft.com%7C74a361e470d2439c712f08da450482a7%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637898182930186151%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=DWYIq9pabsgc1LiaVJOgTg16%2FW5Ghyl4%2F3vHFyPxKn8%3D&reserved=0
Received on 2022-06-03 02:18:36