Date: Wed, 27 Oct 2021 20:14:59 +0000
I tried to write something quick, as CppCon is going on, but this conversation is making me want to work some of this into my talk on Friday.
Rene’s answer below resonates with my experience on Spack (https://github.com/spack/spack), and also with what I know of other projects like Spack (nix, guix, Julia’s Yggdrasil and BinaryBuilder, CondaForge, and to some extent Conan and vcpkg).
In Spack, we represent package metadata for on-disk installations and for binary packages with a DAG that includes the nearly complete provenance of the build. Each node in the DAG has:
* Name of the package
* Version (or, recently, git commit)
* Microarchitecture target
* This allows us to know on what types of systems we can use the binary. e.g. you can’t run an Icelake binary on haswell (if it uses all the vector instructions).
* uarch names are, e.g., haswell/skylake_avx512/power9/etc. — NOT just x86_64 or ppc64le.
* We use names from archspec: https://github.com/archspec/archspec, which also maps uarch features and compiler flags for various compilers to the targets, and tracks compatibility.
* See the mapping (it’s really a DAG of its own, defined in json) here: https://github.com/archspec/archspec-json/blob/master/cpu/microarchitectures.json
* The library lets us make comparisons like canonlake > haswell (True)
* Paper here if you are interested: https://tgamblin.github.io/pubs/archspec-canopie-hpc-2020.pdf
* The operating system the thing was built for
* Right now, this is really a proxy for system glibc version.
* If we had more detailed provenance about libraries we could likely omit it.
* Compiler used to build
* Name of compiler
* Version of compiler
* Compiler flags
* We currently store compiler, linker, and cpp flags (except the ones that pick a target — those are assumed to be handled by the uarch target above)
* Build parameters
* These are specific to the package and we don’t have a common schema for most of them
* They can be single- or multi-valued
* SHAs of any patches applied to the source
* The hash of the package recipe used to build this package
* All this information about all transitive dependencies
* The dependencies are labeled by type: build, link, or run
* Link dependencies tell you what needs to be re-RPATH’d, stuck in LD_LIBRARY_PATH, etc. ignorer to run
* Run dependencies tell you what needs to be present at runtime and put in PATH
* Build dependencies are like run dependencies but for build time.
* A hash of the full DAG (merkle hash of all the data above)
This is what we have right now — it goes in a canonical JSON format and you can use the hash of this metadata to identify a binary. We use hashes to identify things instead of some combinatorial name because every such name I’ve ever seen eventually runs out of room for new options. The format is designed to be extensible — you can add fields and get new hashes for your new artifacts, and they don’t conflict with old ones (if you chose a good hash). This is inspired by systems like Nix — more on the origins of the scheme is in this paper: https://tgamblin.github.io/pubs/spack-sc15.pdf, but we really should update that — a lot has changed since then.
Even with all this information, we do not (yet) have all the information we would like for ABI compatibility. The most glaring issues at the moment are:
* Compiler provenance
* Compilers are really a build dependency that *imposes* specific link dependencies (runtime libraries) on the build.
* They don’t always impose the same runtime libs — consider:
* The intel compilers and clang can depend on gcc for their C++ runtime libraries.
* Both of these can also implicitly add things like OpenMP runtime libraries to the build.
* So, the compiler is really itself a package that can have all the provenance above, and that can add link dependencies to yet more packages.
* OS provenance
* We would really like to do a better job of representing things like glibc version in the DAG, and ultimately omit the OS field by replacing it with more detailed information about implicit runtime libs.
* This is an ongoing thing, and modeling libc is hard, so the OS is a decent proxy for now.
* Systems like Guix and Nix build from libc up, so they don’t really have this problem — their binaries are self-contained so in some sense they solve a simpler problem.
* Naming
* Our model currently assumes that the package name is a suitable key for nodes within it.
* That’s not true if you, say, have two packages that build with different versions of CMake
We are, this fall-ish, making a few changes:
* Moving to a model where we will treat compilers as dependencies and add nodes to the DAG for runtime libraries — this handles things like libstdc++ versions, as well as different builds/versions of the same compiler.
* We’re moving to a model where we don’t key things by name — the hash is used as the key in the final metadata.
All of this metadata is used as input to Spack’s dependency solver. That used to be an ad-hoc thing, but we are now using Clingo (it looks like prolog but boils down to CDCL SAT + optimization — see https://github.com/spack/spack/blob/develop/lib/spack/spack/solver/concretize.lp for the logic program we currently use to manage these constraints).
There is an example of how we’ve architected a solver that can process metadata for lots of installed packages here:
https://github.com/spack/spack/pull/25310
The cool thing about this is that you could express constraints on matching semantics for flags, etc. e.g., you could say that certain flags are ABI-breaking, and that you require them to match across packages, but not others.
I think this is the kind of use case motivating the metadata we’re talking about here — what exactly needs to be checked to ensure that what you’re consuming is compatible with what you are building.
I could say more — particularly around bootstrapping and cross-compiling, but I think this probably should give folks a feeling for what needs to be in the ultimate specification.
-Todd
On Oct 27, 2021, at 11:54 AM, René Ferdinand Rivera Morell via SG15 <sg15_at_[hidden]<mailto:sg15_at_[hidden]>> wrote:
Those are all good starts. But the short answer, since I don't have time ATM, for what you need to communicate to produce and consume prebuilt libraries are:
* All the inputs that went into producing the library. This includes what's been mentioned and also the environment (yes, as in ENV vars and everything else) they were produced in that might affect the result.
* All the usage requirements for the library.
* The entire n-dimensional matrix of compiler flag ABI, and general compatibility for the compilers in question.
It should be obvious that this is an NP complete problem to solve. And that we will only get partial / good-enough solutions to. Just like we do for dealing with ABI.
On Wed, Oct 27, 2021 at 1:38 PM Steve Downey via SG15 <sg15_at_[hidden]<mailto:sg15_at_[hidden]>> wrote:
Include directory(ies) [-I flags], library directories [-L flags], library names [-l flags] . Required preprocessor defines to be consistent with the as-built binary.
What C++ features were used in the pre-built library, particularly needed if we share dependencies. Library polyfills and alternatives are a problem here. For example, boost.json can use std:: types if they are available, or use boost components. A consumer needs to match. The feature test macros might be a tool to expose those.
Names of dependencies to get from the package manager, so I can get their flags and usage.
On Wed, Oct 27, 2021 at 2:06 PM Ben Craig via SG15 <sg15_at_[hidden]<mailto:sg15_at_[hidden]>> wrote:
How do I invoke your code generator (like with gRPC, Apache Thrift, Microsoft RPC, QT)? How do I deal with dependencies with your code generator?
What preprocessor flags and linker flags do I need to set to consume your library? What configuration options are there?
What restrictions are your libraries placing on my binary? For example, am I required to use a static CRT to consume your library? A particular compiler version?
From: SG15 <sg15-bounces_at_[hidden]<mailto:sg15-bounces_at_[hidden]>> On Behalf Of Daniel Ruoso via SG15
Sent: Wednesday, October 27, 2021 11:09 AM
To: sg15_at_[hidden]cpp.org<mailto:sg15_at_[hidden]>
Cc: Daniel Ruoso <daniel_at_[hidden]<mailto:daniel_at_[hidden]>>
Subject: [EXTERNAL] [SG15] RFC: Requirements to consume a prebuilt library from arbitrary build systems
Hello,
Thanks to the opportunity of being on site for cppcon (yay), me, Bret, Gabriel and Cameron had an opportunity to talk about the direction we're going for how to distribute libraries with C++ modules.
In that conversation we got to an agreement that we should also explore the possibility of making this a solution that is not exclusive for modules, and take a step back for a more generic solution.
I'm volunteering to drive the coordination work for a paper framing the requirements for a specification that would allow package managers and build systems to communicate on how to consume a prebuilt library.
I should clarify up front that the goal here is not to specify the format of library packages, the mechanism to resolve versions and how to fetch the package. In fact, the goal is that this should be something that fills existing gaps in the interactions between different package managers and different build systems, hopefully allowing more interoperability amongst the players that are already in this space.
So, I'm starting this effort with a request for comments on:
What are things that your build system needs to know when consuming a prebuilt library? What are the non-obvious cases? What are the architecture-specific details?
daniel
_______________________________________________
SG15 mailing list
SG15_at_[hidden]<mailto:SG15_at_lists.isocpp.org>
https://lists.isocpp.org/mailman/listinfo.cgi/sg15<https://urldefense.us/v3/__https://lists.isocpp.org/mailman/listinfo.cgi/sg15__;!!G2kpM7uM-TzIFchu!glNqWjHQASu4E6dggud0j1_7BWTxf6HUihTO-LiS7VeWwzgBUEZGVUh3QGa5R_qBjA$>
_______________________________________________
SG15 mailing list
SG15_at_[hidden]<mailto:SG15_at_lists.isocpp.org>
https://lists.isocpp.org/mailman/listinfo.cgi/sg15<https://urldefense.us/v3/__https://lists.isocpp.org/mailman/listinfo.cgi/sg15__;!!G2kpM7uM-TzIFchu!glNqWjHQASu4E6dggud0j1_7BWTxf6HUihTO-LiS7VeWwzgBUEZGVUh3QGa5R_qBjA$>
--
-- René Ferdinand Rivera Morell
-- Don't Assume Anything -- No Supone Nada
-- Robot Dreams - http://robot-dreams.net<https://urldefense.us/v3/__http://robot-dreams.net/__;!!G2kpM7uM-TzIFchu!glNqWjHQASu4E6dggud0j1_7BWTxf6HUihTO-LiS7VeWwzgBUEZGVUh3QGYINXxNyQ$>
_______________________________________________
SG15 mailing list
SG15_at_[hidden]<mailto:SG15_at_[hidden]g>
https://urldefense.us/v3/__https://lists.isocpp.org/mailman/listinfo.cgi/sg15__;!!G2kpM7uM-TzIFchu!glNqWjHQASu4E6dggud0j1_7BWTxf6HUihTO-LiS7VeWwzgBUEZGVUh3QGa5R_qBjA$
Rene’s answer below resonates with my experience on Spack (https://github.com/spack/spack), and also with what I know of other projects like Spack (nix, guix, Julia’s Yggdrasil and BinaryBuilder, CondaForge, and to some extent Conan and vcpkg).
In Spack, we represent package metadata for on-disk installations and for binary packages with a DAG that includes the nearly complete provenance of the build. Each node in the DAG has:
* Name of the package
* Version (or, recently, git commit)
* Microarchitecture target
* This allows us to know on what types of systems we can use the binary. e.g. you can’t run an Icelake binary on haswell (if it uses all the vector instructions).
* uarch names are, e.g., haswell/skylake_avx512/power9/etc. — NOT just x86_64 or ppc64le.
* We use names from archspec: https://github.com/archspec/archspec, which also maps uarch features and compiler flags for various compilers to the targets, and tracks compatibility.
* See the mapping (it’s really a DAG of its own, defined in json) here: https://github.com/archspec/archspec-json/blob/master/cpu/microarchitectures.json
* The library lets us make comparisons like canonlake > haswell (True)
* Paper here if you are interested: https://tgamblin.github.io/pubs/archspec-canopie-hpc-2020.pdf
* The operating system the thing was built for
* Right now, this is really a proxy for system glibc version.
* If we had more detailed provenance about libraries we could likely omit it.
* Compiler used to build
* Name of compiler
* Version of compiler
* Compiler flags
* We currently store compiler, linker, and cpp flags (except the ones that pick a target — those are assumed to be handled by the uarch target above)
* Build parameters
* These are specific to the package and we don’t have a common schema for most of them
* They can be single- or multi-valued
* SHAs of any patches applied to the source
* The hash of the package recipe used to build this package
* All this information about all transitive dependencies
* The dependencies are labeled by type: build, link, or run
* Link dependencies tell you what needs to be re-RPATH’d, stuck in LD_LIBRARY_PATH, etc. ignorer to run
* Run dependencies tell you what needs to be present at runtime and put in PATH
* Build dependencies are like run dependencies but for build time.
* A hash of the full DAG (merkle hash of all the data above)
This is what we have right now — it goes in a canonical JSON format and you can use the hash of this metadata to identify a binary. We use hashes to identify things instead of some combinatorial name because every such name I’ve ever seen eventually runs out of room for new options. The format is designed to be extensible — you can add fields and get new hashes for your new artifacts, and they don’t conflict with old ones (if you chose a good hash). This is inspired by systems like Nix — more on the origins of the scheme is in this paper: https://tgamblin.github.io/pubs/spack-sc15.pdf, but we really should update that — a lot has changed since then.
Even with all this information, we do not (yet) have all the information we would like for ABI compatibility. The most glaring issues at the moment are:
* Compiler provenance
* Compilers are really a build dependency that *imposes* specific link dependencies (runtime libraries) on the build.
* They don’t always impose the same runtime libs — consider:
* The intel compilers and clang can depend on gcc for their C++ runtime libraries.
* Both of these can also implicitly add things like OpenMP runtime libraries to the build.
* So, the compiler is really itself a package that can have all the provenance above, and that can add link dependencies to yet more packages.
* OS provenance
* We would really like to do a better job of representing things like glibc version in the DAG, and ultimately omit the OS field by replacing it with more detailed information about implicit runtime libs.
* This is an ongoing thing, and modeling libc is hard, so the OS is a decent proxy for now.
* Systems like Guix and Nix build from libc up, so they don’t really have this problem — their binaries are self-contained so in some sense they solve a simpler problem.
* Naming
* Our model currently assumes that the package name is a suitable key for nodes within it.
* That’s not true if you, say, have two packages that build with different versions of CMake
We are, this fall-ish, making a few changes:
* Moving to a model where we will treat compilers as dependencies and add nodes to the DAG for runtime libraries — this handles things like libstdc++ versions, as well as different builds/versions of the same compiler.
* We’re moving to a model where we don’t key things by name — the hash is used as the key in the final metadata.
All of this metadata is used as input to Spack’s dependency solver. That used to be an ad-hoc thing, but we are now using Clingo (it looks like prolog but boils down to CDCL SAT + optimization — see https://github.com/spack/spack/blob/develop/lib/spack/spack/solver/concretize.lp for the logic program we currently use to manage these constraints).
There is an example of how we’ve architected a solver that can process metadata for lots of installed packages here:
https://github.com/spack/spack/pull/25310
The cool thing about this is that you could express constraints on matching semantics for flags, etc. e.g., you could say that certain flags are ABI-breaking, and that you require them to match across packages, but not others.
I think this is the kind of use case motivating the metadata we’re talking about here — what exactly needs to be checked to ensure that what you’re consuming is compatible with what you are building.
I could say more — particularly around bootstrapping and cross-compiling, but I think this probably should give folks a feeling for what needs to be in the ultimate specification.
-Todd
On Oct 27, 2021, at 11:54 AM, René Ferdinand Rivera Morell via SG15 <sg15_at_[hidden]<mailto:sg15_at_[hidden]>> wrote:
Those are all good starts. But the short answer, since I don't have time ATM, for what you need to communicate to produce and consume prebuilt libraries are:
* All the inputs that went into producing the library. This includes what's been mentioned and also the environment (yes, as in ENV vars and everything else) they were produced in that might affect the result.
* All the usage requirements for the library.
* The entire n-dimensional matrix of compiler flag ABI, and general compatibility for the compilers in question.
It should be obvious that this is an NP complete problem to solve. And that we will only get partial / good-enough solutions to. Just like we do for dealing with ABI.
On Wed, Oct 27, 2021 at 1:38 PM Steve Downey via SG15 <sg15_at_[hidden]<mailto:sg15_at_[hidden]>> wrote:
Include directory(ies) [-I flags], library directories [-L flags], library names [-l flags] . Required preprocessor defines to be consistent with the as-built binary.
What C++ features were used in the pre-built library, particularly needed if we share dependencies. Library polyfills and alternatives are a problem here. For example, boost.json can use std:: types if they are available, or use boost components. A consumer needs to match. The feature test macros might be a tool to expose those.
Names of dependencies to get from the package manager, so I can get their flags and usage.
On Wed, Oct 27, 2021 at 2:06 PM Ben Craig via SG15 <sg15_at_[hidden]<mailto:sg15_at_[hidden]>> wrote:
How do I invoke your code generator (like with gRPC, Apache Thrift, Microsoft RPC, QT)? How do I deal with dependencies with your code generator?
What preprocessor flags and linker flags do I need to set to consume your library? What configuration options are there?
What restrictions are your libraries placing on my binary? For example, am I required to use a static CRT to consume your library? A particular compiler version?
From: SG15 <sg15-bounces_at_[hidden]<mailto:sg15-bounces_at_[hidden]>> On Behalf Of Daniel Ruoso via SG15
Sent: Wednesday, October 27, 2021 11:09 AM
To: sg15_at_[hidden]cpp.org<mailto:sg15_at_[hidden]>
Cc: Daniel Ruoso <daniel_at_[hidden]<mailto:daniel_at_[hidden]>>
Subject: [EXTERNAL] [SG15] RFC: Requirements to consume a prebuilt library from arbitrary build systems
Hello,
Thanks to the opportunity of being on site for cppcon (yay), me, Bret, Gabriel and Cameron had an opportunity to talk about the direction we're going for how to distribute libraries with C++ modules.
In that conversation we got to an agreement that we should also explore the possibility of making this a solution that is not exclusive for modules, and take a step back for a more generic solution.
I'm volunteering to drive the coordination work for a paper framing the requirements for a specification that would allow package managers and build systems to communicate on how to consume a prebuilt library.
I should clarify up front that the goal here is not to specify the format of library packages, the mechanism to resolve versions and how to fetch the package. In fact, the goal is that this should be something that fills existing gaps in the interactions between different package managers and different build systems, hopefully allowing more interoperability amongst the players that are already in this space.
So, I'm starting this effort with a request for comments on:
What are things that your build system needs to know when consuming a prebuilt library? What are the non-obvious cases? What are the architecture-specific details?
daniel
_______________________________________________
SG15 mailing list
SG15_at_[hidden]<mailto:SG15_at_lists.isocpp.org>
https://lists.isocpp.org/mailman/listinfo.cgi/sg15<https://urldefense.us/v3/__https://lists.isocpp.org/mailman/listinfo.cgi/sg15__;!!G2kpM7uM-TzIFchu!glNqWjHQASu4E6dggud0j1_7BWTxf6HUihTO-LiS7VeWwzgBUEZGVUh3QGa5R_qBjA$>
_______________________________________________
SG15 mailing list
SG15_at_[hidden]<mailto:SG15_at_lists.isocpp.org>
https://lists.isocpp.org/mailman/listinfo.cgi/sg15<https://urldefense.us/v3/__https://lists.isocpp.org/mailman/listinfo.cgi/sg15__;!!G2kpM7uM-TzIFchu!glNqWjHQASu4E6dggud0j1_7BWTxf6HUihTO-LiS7VeWwzgBUEZGVUh3QGa5R_qBjA$>
--
-- René Ferdinand Rivera Morell
-- Don't Assume Anything -- No Supone Nada
-- Robot Dreams - http://robot-dreams.net<https://urldefense.us/v3/__http://robot-dreams.net/__;!!G2kpM7uM-TzIFchu!glNqWjHQASu4E6dggud0j1_7BWTxf6HUihTO-LiS7VeWwzgBUEZGVUh3QGYINXxNyQ$>
_______________________________________________
SG15 mailing list
SG15_at_[hidden]<mailto:SG15_at_[hidden]g>
https://urldefense.us/v3/__https://lists.isocpp.org/mailman/listinfo.cgi/sg15__;!!G2kpM7uM-TzIFchu!glNqWjHQASu4E6dggud0j1_7BWTxf6HUihTO-LiS7VeWwzgBUEZGVUh3QGa5R_qBjA$
Received on 2021-10-27 15:15:21