ISOCPP std-proposals List: Re: [std-proposals] Revising #pragma once

From: Tom Honermann <tom_at_[hidden]>
Date: Sun, 1 Sep 2024 01:32:32 -0400

On 8/30/24 5:50 AM, Gašper Ažman wrote:
> If you knew up-front you wouldn't do it :).
>
> This happens though. There are people who generate code - whole
> include trees - and for plugins they end up looking very very similar
> or identical. And then as the build engineer my users complain that
> "their build doesn't work on the cloud build environment" and I have
> to somehow *find their bug*.
>
> Think about it. How the hell do you diagnose this silent noninclusion?

In my experience, very carefully, over a long period of time, with much
hair pulling and lots of WTFs.

I'm going to relate my most memorable experience with this feature
below, but I'll preface this by saying that the root cause in this case
was a bug in the implementation. Readers are free to say, yeah, bugs
happen, nothing to see here. and that is fair. The point is not (just)
to relay yet another story from the trenches, but to enforce Gašper's
point that debugging a case where #pragma once fails is about as hard as
debugging UB or a miscompilation.

I was a Coverity developer for ~10 years working on the Coverity C and
C++ front ends. Coverity is one of a number of tools in the unenviable
position of having to emulate the behaviors (intended and otherwise) of
many other compilers. The emulation requirements extend to compiler
extensions, implementation-defined behavior, and even unspecified
behavior. Coverity includes multiple C and C++ front end
implementations, one of which is chosen to use to emulate the behavior
of a given compiler. Exact matching of front end behavior for all
supported compilers is not economically feasible, so different behavior
becomes observable at times. Sometimes in surprising ways.

It was uncommon for me to get directly involved in debugging a problem
happening in a customer environment. A bug report came in that Coverity
was failing to successfully compile parts of the customer's code; errors
were being reported about undeclared names and failed overload
resolution. The customer was using MSVC which, of course, was having no
issues compiling their code. The first course of action in such cases is
to request preprocessed output produced by both the native compiler
(MSVC) and Coverity's front end (don't get me started about how much
harder C++ modules will make this first step if programmers start
actually using C++ modules, but I digress). The reported errors
reproduced easily using the Coverity produced preprocessed output.
Searching for the undeclared names from the error messages in the MSVC
preprocessed output made it clear early on that large chunks of the
source code were simply missing. Thanks to #line directives in the MSVC
preprocessed output, it wasn't too hard to correlate the missing content
with a particular source file and a request was sent to the customer to
share that file, which they did. The file used a proper header guard (it
did not use #pragma once), so I naturally went down the rabbit hole of
trying to figure out why the guard macro was apparently defined when
Coverity processed the file, but not when MSVC did. It isn't all that
uncommon for Coverity customers to have to implement workarounds to
address cases where Coverity doesn't exactly emulate some aspect of
another compiler, so I assumed that the customer had probably applied a
workaround for some other issue that was somehow now contributing to
this issue and I started asking those "what did you do?" questions in
the most polite corporate phrasing I could come up with.

I didn't get to blame the customer for doing something silly. They
weren't doing anything weird or unusual. I spent lots of time looking at
code in our front end trying to imagine how we could possibly be
processing a header guard incorrectly. The front end cached knowledge
about header guards and used that information to avoid re-tokenizing a
re-included source file. I wondered if, somehow, an invalid cache hit
was happening, but nothing seemed wrong. Header guards are not rocket
science.

I wondered if include paths were somehow inconsistent and if Coverity
was resolving the #include directive to a different file; perhaps an
empty file or one with a different header guard macro. No such luck. I
didn't get to blame the customer for having multiple copies of a file
with inconsistent contents in different locations.

It got worse. It turned out there wasn't a single file that was
problematic. On different runs, content would be missing for different
source files.

I started asking about the customer build environment. That was when we
got the first indication of where to look that was actually on the right
path. The customer was building source code hosted on a network storage
device. I wondered about some kind of weird filesystem corruption that
caused an empty file to sometimes be served instead of the actual file
contents. I hypothesized improbable bugs in which an empty file was only
produced with the file system APIs that Coverity used and not with the
ones that MSVC used. I asked the customer to verify that their network
storage device had all updates applied.

I tried replicating the problem in house using network storage devices
available within the company. No luck of course.

I eventually had to resort to sending the customer builds of Coverity
with extra debugging instrumentation to run in their environment. I'm
sure everyone reading this can appreciate that, when things get to the
point that you are asking customers to run one-off modified builds in
their production environment in order to figure out WTF is going wrong
with your software, that things are just not going well. I was still
focused on the header guard cache, so that is where I added some debug
instrumentation. The header guard cache naturally relied on recognition
of the same file being re-included and used inodes to do so. So I added
logging which, after several iterations of narrowing the logging, much
to my surprise, demonstrated that the same inode was being returned for
distinct files. My head exploded.

Remember when 640K was enough for everybody? These days, 64 bits isn't
enough for an inode. Coverity was using the Win32
GetFileInformationByHandle()
<https://learn.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-getfileinformationbyhandle>
function to retrieve file inodes (file IDs or indexes in the Win32
documentation). The inode/ID/index information is returned in the
BY_HANDLE_FILE_INFORMATION
<https://learn.microsoft.com/en-us/windows/win32/api/fileapi/ns-fileapi-by_handle_file_information>
structure; the combination of the dwVolumeSerialNumber, nFileIndexHigh,
and nFileIndexLow data members provide for unique identification for any
open file.

Pause here and read that last sentence again. For any *open* file. The
documentation isn't particularly explicit about this, but filesystems
are free to reuse file index values once the last handle for a file is
closed. In this case, the network storage device was apparently mapping
its 128 bit inodes to a pool of 64-bit inodes that it recycled as files
were closed. Coverity wasn't keeping previously included files open so
its cached inodes were effectively dangling pointers.

The fix was simple. We disabled use of inodes to recognize re-included
files (on Windows). We could have considered reworking the code to call
the newer Win32 GetFileInformationByHandleEx()
<https://learn.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-getfileinformationbyhandleex>
function to obtain a 128 bit file ID via the FILE_ID_INFO
<https://learn.microsoft.com/en-us/windows/win32/api/winbase/ns-winbase-file_id_info>
structure. I don't know if that would have been any more reliable; this
function might also only guarantee stability of a file ID while a file
is open; the documentation isn't clear on this. Rewriting the code to
keep all files open for the duration of the compilation would have
required significant change and would have required raising limits on
number of open files.

MSVC didn't hit this problem because it relies on canonical path
comparisons rather than filesystem identity to recognize re-included
files. If I remember correctly, it is possible to defeat the MSVC
#pragma once implementation through use of sym links, hard links,
junction points, etc... However, you just end up with double inclusion
in that case; something that is much easier to diagnose than mysterious
never inclusion.

I was lucky that I was able to reference the MSVC preprocessed output in
order to determine that chunks of source code were missing. Comparison
with other implementations isn't necessarily an option for general
compiler users.

We were also lucky that we were able to just disable the broken inode
dependent code. Thankfully, header guards Just Work^TM without need for
fancy and complicated mechanisms. #pragma once can't.

Tom.

>
> On Fri, Aug 30, 2024 at 9:30 AM Marcin Jaczewski
> <marcinjaczewski86_at_[hidden]> wrote:
>
> pt., 30 sie 2024 o 01:20 Gašper Ažman via Std-Proposals
> <std-proposals_at_[hidden]> napisał(a):
> >
> > Darlings,
> >
> > byte-identical is just plain incorrect. Consider.
> >
> > [library1/public.hpp]
> > #pragma once
> > #include "utils.hpp"
> >
> > [library1/utils.hpp]
> > class lib1 {};
> >
> > library2/public.hpp
> > #pragma once
> > #include "utils.hpp"
> >
> > [library2/utils.hpp]
> > class lib2 {};
> >
> > [main.cpp]
> > #include "library1/public.hpp"
> > #include "library2/public.hpp" # boom, library2/utils.hpp does
> not get included
> >
> > same-contents also means same-relative-include-trees.
> Congratulations.
> >
>
> Question is do you know upfront that files are problematic?
> Then we could do something like this:
> a) we only consider normalized paths
> b) use have option to add path mappings that are used in path
> comparison
>
> Like:
>
> We have folders `libs/fooV1` and `libs/fooV2`. Both folders are
> symlinks
> but by default when compiler load `libs/fooV1/bar.h` and
> `libs/fooV2/bar.h`
> the compiler considers them as separate files even when they are
> the same file.
> Now we add compiler flag `-PI libs/fooV1:libs/fooV2` (similar to
> `-I`) and now
> when compiler load `libs/fooV1/foo.h` he consider it as if he load
> `libs/fooV2/bar.h` and thus next loading of `libs/fooV2/bar.h`
> will be blocked.
>
> And this should be ver dumb process it could be cases when
> `libs/fooV1/bar.h`
> and `libs/fooV2/bar.h` are unrelated but if you make specific
> maping it will
> override even diffrnet files. This could be useful to hacking some
> broken
> includes like:
>
> ```
> -PI hack/foo/bar.h:libs/fooV2/bar.h
> ```
>
> and we can in our files do:
>
> ```
> #include "hack/foo/bar.h"
> #include "libs/fooV2/foo.h" // it will ignore `#include "bar.h"`
> ```
>
> Could mechnism like this work on your build system?
>
> > On Fri, Aug 30, 2024 at 12:15 AM Jeremy Rifkin via Std-Proposals
> <std-proposals_at_[hidden]> wrote:
> >>
> >> > In this very thread there are examples showing why taking
> only the content into account doesn't work but it was brushed off
> as "that can be fixed".
> >>
> >> I'm sorry you feel I have brushed any presented examples off. I
> have
> >> found them all immensely helpful for consideration. It's hard
> for me
> >> to imagine times when you'd want the same include-guarded content
> >> included twice, however I found the example of a "main header"
> >> compelling. The example of a header that only defines macros
> you undef
> >> is also not impractical.
> >>
> >> However, there are also compelling examples for filesystem
> identity.
> >> Mainly the multiple mount point issue.
> >>
> >> I think both can be reasonable, however, I have been trying to
> >> understand the most probable failure modes. While I originally
> >> proposed a content-based definition, I do think a filesystem-based
> >> definition is closer to current semantics and expectations.
> >>
> >> Jeremy
> >>
> >> On Thu, Aug 29, 2024 at 5:24 PM Breno Guimarães via Std-Proposals
> >> <std-proposals_at_[hidden]> wrote:
> >> >
> >> > To add to that, the whole idea is to standardize standard
> practice. If the first thing you do is to change spec to something
> else, then you're not standardizing standard practice, you are
> adding a new feature that inconveniently clashes with an existing one.
> >> >
> >> > In this very thread there are examples showing why taking
> only the content into account doesn't work but it was brushed off
> as "that can be fixed".
> >> >
> >> > None of this make sense to me.
> >> >
> >> >
> >> > Em qui., 29 de ago. de 2024 18:59, Tiago Freire via
> Std-Proposals <std-proposals_at_[hidden]> escreveu:
> >> >>
> >> >> Again, hashing content... totally unnecessary.
> >> >>
> >> >> There's no need to identify "same content" which as far as I
> can see can be defeated by modifications that don't change the
> interpretation, like spaces, which although not technically a
> violation of "same content" it clearly defeats the intent.
> >> >>
> >> >> An include summons a resource, a pragma once bars that
> resources from bey re-summonable. That's it. File paths should be
> more than enough.
> >> >>
> >> >> I'm unconvinced that the "bad cases" are not just a product
> of bad build architecture, if done properly a compiler should
> never be presented with multiple alternatives of the same file.
> And putting such requirements on compilers puts an unnecessary
> burden on developers to support a scenario that it is that is
> arguably bad practice.
> >> >>
> >> >> The argument is "prgma once" is supported everywhere it is
> good, we should make it official in the standard, effectively no
> change to a compiler should occur as a consequence.
> >> >> If a change needs to occur, then in fact "your version" of
> what you mean by "pragma once" is actually "not supported" by all
> the major compilers.
> >> >>
> >> >> Current compiler support of "pragma once" and it's usage on
> cross platform projects have a particular way of dealing with
> dependencies in mind. That workflow works. It's pointless to have
> this discussion if you don't understand that flow, and you
> shouldn't tailor the tool to a workflow that doesn't exist to the
> detriment of all.
> >> >>
> >> >>
> >> >> ________________________________
> >> >> From: Std-Proposals <std-proposals-bounces_at_[hidden]>
> on behalf of Jeremy Rifkin via Std-Proposals
> <std-proposals_at_[hidden]>
> >> >> Sent: Thursday, August 29, 2024 9:56:18 PM
> >> >> To: Tom Honermann <tom_at_[hidden]>
> >> >> Cc: Jeremy Rifkin <rifkin.jer_at_[hidden]>;
> std-proposals_at_[hidden] <std-proposals_at_[hidden]>
> >> >> Subject: Re: [std-proposals] Revising #pragma once
> >> >>
> >> >> Performance should be fine if using a content definition. An
> implementation can do inode/path checks against files it already
> knows of, as a fast path. The first time a file is #included it’s
> just a hash+table lookup to decide whether to continue.
> >> >>
> >> >> Regarding the filesystem definition vs content definition
> question, while I think a content-based definition is robust I can
> see there is FUD about it and also an argument about current
> practice being a filesystem-based definition. It may just be best
> to approach this as filesystem uniqueness to the implementation’s
> ability, with a requirement that symbolic links/hard links are
> handled. This doesn’t cover the case of multiple mount points, but
> we’ve discussed that that’s impossible with #pragma once without
> using contents instead.
> >> >>
> >> >> Jeremy
> >> >>
> >> >> On Thu, Aug 29, 2024 at 13:06 Tom Honermann
> <tom_at_[hidden]> wrote:
> >> >>>
> >> >>> On 8/28/24 12:32 AM, Jeremy Rifkin via Std-Proposals wrote:
> >> >>>
> >> >>> Another question is whether the comparison should be post
> translation
> >> >>> phase 1.
> >> >>>
> >> >>> I gave this some thought while drafting the proposal. I
> think it comes
> >> >>> down to whether the intent is single inclusion of files or
> single
> >> >>> inclusion of contents.
> >> >>>
> >> >>> Indeed. The proposal currently favors the "same contents"
> approach and offers the following wording.
> >> >>>
> >> >>> A preprocessing directive of the form
> >> >>> # pragma once new-line
> >> >>> shall cause no subsequent #include directives to perform
> replacement for a file with text contents identical to this file.
> >> >>>
> >> >>> The wording will have to define what it means for contents
> to be identical. Options include:
> >> >>>
> >> >>> The files must be byte-for-byte identical. This makes
> source file encoding observable (which I would be strongly against).
> >> >>> The files must encode the same character sequence post
> translation phase 1. This makes comparisons potentially expensive.
> >> >>>
> >> >>> Note that the "same contents" approach obligates an
> implementation to consider every previously encountered file for
> every #include directive. An inode based optimization can help to
> determine if a file was previously encountered based on identity,
> but it doesn't help to reduce the costs when a file that was not
> previously seen is encountered.
> >> >>>
> >> >>>
> >> >>> Tom.
> >> >>>
> >> >>> Jeremy
> >> >>>
> >> >>> On Tue, Aug 27, 2024 at 3:39 PM Tom Honermann via Std-Proposals
> >> >>> <std-proposals_at_[hidden]> wrote:
> >> >>>
> >> >>> On 8/27/24 4:10 PM, Thiago Macieira via Std-Proposals wrote:
> >> >>>
> >> >>> On Tuesday 27 August 2024 12:35:17 GMT-7 Andrey Semashev
> via Std-Proposals
> >> >>> wrote:
> >> >>>
> >> >>> The fact that gcc took the approach to compare file
> contents I consider
> >> >>> a poor choice, and not an argument to standardize this
> implementation.
> >> >>>
> >> >>> Another question is whether a byte comparison of two files
> of the same size is
> >> >>> expensive for compilers.
> >> >>>
> >> >>> #once ID doesn't need to compare the entire file.
> >> >>>
> >> >>> Another question is whether the comparison should be post
> translation
> >> >>> phase 1. In other words, whether differently encoded source
> files that
> >> >>> decode to the same sequence of code points are considered
> the same file
> >> >>> (e.g., a Windows-1252 version and a UTF-8 version).
> Standard C++ does
> >> >>> not currently allow source file encoding to be observable
> but a #pragma
> >> >>> once implementation that only compares bytes would make
> such differences
> >> >>> observable.
> >> >>>
> >> >>> Tom.
> >> >>>
> >> >>> --
> >> >>> Std-Proposals mailing list
> >> >>> Std-Proposals_at_[hidden]
> >> >>> https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals
> >> >>
> >> >>
> >> >> --
> >> >> Std-Proposals mailing list
> >> >> Std-Proposals_at_[hidden]
> >> >> https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals
> >> >
> >> > --
> >> > Std-Proposals mailing list
> >> > Std-Proposals_at_[hidden]
> >> > https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals
> >> --
> >> Std-Proposals mailing list
> >> Std-Proposals_at_[hidden]
> >> https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals
> >
> > --
> > Std-Proposals mailing list
> > Std-Proposals_at_[hidden]
> > https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals
>

Received on 2024-09-01 05:32:41