ISOCPP std-proposals List: Re: [std-proposals] Revising #pragma once

From: Gašper Ažman <gasper.azman_at_[hidden]>
Date: Sun, 1 Sep 2024 14:10:55 +0100

That's an incredible case study. Thank you Tom.

On Sun, Sep 1, 2024, 06:32 Tom Honermann <tom_at_[hidden]> wrote:

> On 8/30/24 5:50 AM, Gašper Ažman wrote:
>
> If you knew up-front you wouldn't do it :).
>
> This happens though. There are people who generate code - whole include
> trees - and for plugins they end up looking very very similar or identical.
> And then as the build engineer my users complain that "their build doesn't
> work on the cloud build environment" and I have to somehow *find their bug*.
>
> Think about it. How the hell do you diagnose this silent noninclusion?
>
> In my experience, very carefully, over a long period of time, with much
> hair pulling and lots of WTFs.
>
> I'm going to relate my most memorable experience with this feature below,
> but I'll preface this by saying that the root cause in this case was a bug
> in the implementation. Readers are free to say, yeah, bugs happen, nothing
> to see here. and that is fair. The point is not (just) to relay yet another
> story from the trenches, but to enforce Gašper's point that debugging a
> case where #pragma once fails is about as hard as debugging UB or a
> miscompilation.
>
> I was a Coverity developer for ~10 years working on the Coverity C and C++
> front ends. Coverity is one of a number of tools in the unenviable position
> of having to emulate the behaviors (intended and otherwise) of many other
> compilers. The emulation requirements extend to compiler extensions,
> implementation-defined behavior, and even unspecified behavior. Coverity
> includes multiple C and C++ front end implementations, one of which is
> chosen to use to emulate the behavior of a given compiler. Exact matching
> of front end behavior for all supported compilers is not economically
> feasible, so different behavior becomes observable at times. Sometimes in
> surprising ways.
>
> It was uncommon for me to get directly involved in debugging a problem
> happening in a customer environment. A bug report came in that Coverity was
> failing to successfully compile parts of the customer's code; errors were
> being reported about undeclared names and failed overload resolution. The
> customer was using MSVC which, of course, was having no issues compiling
> their code. The first course of action in such cases is to request
> preprocessed output produced by both the native compiler (MSVC) and
> Coverity's front end (don't get me started about how much harder C++
> modules will make this first step if programmers start actually using C++
> modules, but I digress). The reported errors reproduced easily using the
> Coverity produced preprocessed output. Searching for the undeclared names
> from the error messages in the MSVC preprocessed output made it clear early
> on that large chunks of the source code were simply missing. Thanks to
> #line directives in the MSVC preprocessed output, it wasn't too hard to
> correlate the missing content with a particular source file and a request
> was sent to the customer to share that file, which they did. The file used
> a proper header guard (it did not use #pragma once), so I naturally went
> down the rabbit hole of trying to figure out why the guard macro was
> apparently defined when Coverity processed the file, but not when MSVC did.
> It isn't all that uncommon for Coverity customers to have to implement
> workarounds to address cases where Coverity doesn't exactly emulate some
> aspect of another compiler, so I assumed that the customer had probably
> applied a workaround for some other issue that was somehow now contributing
> to this issue and I started asking those "what did you do?" questions in
> the most polite corporate phrasing I could come up with.
>
> I didn't get to blame the customer for doing something silly. They weren't
> doing anything weird or unusual. I spent lots of time looking at code in
> our front end trying to imagine how we could possibly be processing a
> header guard incorrectly. The front end cached knowledge about header
> guards and used that information to avoid re-tokenizing a re-included
> source file. I wondered if, somehow, an invalid cache hit was happening,
> but nothing seemed wrong. Header guards are not rocket science.
>
> I wondered if include paths were somehow inconsistent and if Coverity was
> resolving the #include directive to a different file; perhaps an empty
> file or one with a different header guard macro. No such luck. I didn't get
> to blame the customer for having multiple copies of a file with
> inconsistent contents in different locations.
>
> It got worse. It turned out there wasn't a single file that was
> problematic. On different runs, content would be missing for different
> source files.
>
> I started asking about the customer build environment. That was when we
> got the first indication of where to look that was actually on the right
> path. The customer was building source code hosted on a network storage
> device. I wondered about some kind of weird filesystem corruption that
> caused an empty file to sometimes be served instead of the actual file
> contents. I hypothesized improbable bugs in which an empty file was only
> produced with the file system APIs that Coverity used and not with the ones
> that MSVC used. I asked the customer to verify that their network storage
> device had all updates applied.
>
> I tried replicating the problem in house using network storage devices
> available within the company. No luck of course.
>
> I eventually had to resort to sending the customer builds of Coverity with
> extra debugging instrumentation to run in their environment. I'm sure
> everyone reading this can appreciate that, when things get to the point
> that you are asking customers to run one-off modified builds in their
> production environment in order to figure out WTF is going wrong with your
> software, that things are just not going well. I was still focused on the
> header guard cache, so that is where I added some debug instrumentation.
> The header guard cache naturally relied on recognition of the same file
> being re-included and used inodes to do so. So I added logging which, after
> several iterations of narrowing the logging, much to my surprise,
> demonstrated that the same inode was being returned for distinct files. My
> head exploded.
>
> Remember when 640K was enough for everybody? These days, 64 bits isn't
> enough for an inode. Coverity was using the Win32
> GetFileInformationByHandle()
> <https://learn.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-getfileinformationbyhandle>
> function to retrieve file inodes (file IDs or indexes in the Win32
> documentation). The inode/ID/index information is returned in the
> BY_HANDLE_FILE_INFORMATION
> <https://learn.microsoft.com/en-us/windows/win32/api/fileapi/ns-fileapi-by_handle_file_information>
> structure; the combination of the dwVolumeSerialNumber, nFileIndexHigh,
> and nFileIndexLow data members provide for unique identification for any
> open file.
>
> Pause here and read that last sentence again. For any *open* file. The
> documentation isn't particularly explicit about this, but filesystems are
> free to reuse file index values once the last handle for a file is closed.
> In this case, the network storage device was apparently mapping its 128 bit
> inodes to a pool of 64-bit inodes that it recycled as files were closed.
> Coverity wasn't keeping previously included files open so its cached inodes
> were effectively dangling pointers.
>
> The fix was simple. We disabled use of inodes to recognize re-included
> files (on Windows). We could have considered reworking the code to call the
> newer Win32 GetFileInformationByHandleEx()
> <https://learn.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-getfileinformationbyhandleex>
> function to obtain a 128 bit file ID via the FILE_ID_INFO
> <https://learn.microsoft.com/en-us/windows/win32/api/winbase/ns-winbase-file_id_info>
> structure. I don't know if that would have been any more reliable; this
> function might also only guarantee stability of a file ID while a file is
> open; the documentation isn't clear on this. Rewriting the code to keep all
> files open for the duration of the compilation would have required
> significant change and would have required raising limits on number of open
> files.
>
> MSVC didn't hit this problem because it relies on canonical path
> comparisons rather than filesystem identity to recognize re-included files.
> If I remember correctly, it is possible to defeat the MSVC #pragma once
> implementation through use of sym links, hard links, junction points,
> etc... However, you just end up with double inclusion in that case;
> something that is much easier to diagnose than mysterious never inclusion.
>
> I was lucky that I was able to reference the MSVC preprocessed output in
> order to determine that chunks of source code were missing. Comparison with
> other implementations isn't necessarily an option for general compiler
> users.
>
> We were also lucky that we were able to just disable the broken inode
> dependent code. Thankfully, header guards Just WorkTM without need for
> fancy and complicated mechanisms. #pragma once can't.
>
> Tom.
>
>
> On Fri, Aug 30, 2024 at 9:30 AM Marcin Jaczewski <
> marcinjaczewski86_at_[hidden]> wrote:
>
>> pt., 30 sie 2024 o 01:20 Gašper Ažman via Std-Proposals
>> <std-proposals_at_[hidden]> napisał(a):
>> >
>> > Darlings,
>> >
>> > byte-identical is just plain incorrect. Consider.
>> >
>> > [library1/public.hpp]
>> > #pragma once
>> > #include "utils.hpp"
>> >
>> > [library1/utils.hpp]
>> > class lib1 {};
>> >
>> > library2/public.hpp
>> > #pragma once
>> > #include "utils.hpp"
>> >
>> > [library2/utils.hpp]
>> > class lib2 {};
>> >
>> > [main.cpp]
>> > #include "library1/public.hpp"
>> > #include "library2/public.hpp" # boom, library2/utils.hpp does not get
>> included
>> >
>> > same-contents also means same-relative-include-trees. Congratulations.
>> >
>>
>> Question is do you know upfront that files are problematic?
>> Then we could do something like this:
>> a) we only consider normalized paths
>> b) use have option to add path mappings that are used in path comparison
>>
>> Like:
>>
>> We have folders `libs/fooV1` and `libs/fooV2`. Both folders are symlinks
>> but by default when compiler load `libs/fooV1/bar.h` and
>> `libs/fooV2/bar.h`
>> the compiler considers them as separate files even when they are the same
>> file.
>> Now we add compiler flag `-PI libs/fooV1:libs/fooV2` (similar to `-I`)
>> and now
>> when compiler load `libs/fooV1/foo.h` he consider it as if he load
>> `libs/fooV2/bar.h` and thus next loading of `libs/fooV2/bar.h` will be
>> blocked.
>>
>> And this should be ver dumb process it could be cases when
>> `libs/fooV1/bar.h`
>> and `libs/fooV2/bar.h` are unrelated but if you make specific maping it
>> will
>> override even diffrnet files. This could be useful to hacking some broken
>> includes like:
>>
>> ```
>> -PI hack/foo/bar.h:libs/fooV2/bar.h
>> ```
>>
>> and we can in our files do:
>>
>> ```
>> #include "hack/foo/bar.h"
>> #include "libs/fooV2/foo.h" // it will ignore `#include "bar.h"`
>> ```
>>
>> Could mechnism like this work on your build system?
>>
>> > On Fri, Aug 30, 2024 at 12:15 AM Jeremy Rifkin via Std-Proposals <
>> std-proposals_at_[hidden]> wrote:
>> >>
>> >> > In this very thread there are examples showing why taking only the
>> content into account doesn't work but it was brushed off as "that can be
>> fixed".
>> >>
>> >> I'm sorry you feel I have brushed any presented examples off. I have
>> >> found them all immensely helpful for consideration. It's hard for me
>> >> to imagine times when you'd want the same include-guarded content
>> >> included twice, however I found the example of a "main header"
>> >> compelling. The example of a header that only defines macros you undef
>> >> is also not impractical.
>> >>
>> >> However, there are also compelling examples for filesystem identity.
>> >> Mainly the multiple mount point issue.
>> >>
>> >> I think both can be reasonable, however, I have been trying to
>> >> understand the most probable failure modes. While I originally
>> >> proposed a content-based definition, I do think a filesystem-based
>> >> definition is closer to current semantics and expectations.
>> >>
>> >> Jeremy
>> >>
>> >> On Thu, Aug 29, 2024 at 5:24 PM Breno Guimarães via Std-Proposals
>> >> <std-proposals_at_[hidden]> wrote:
>> >> >
>> >> > To add to that, the whole idea is to standardize standard practice.
>> If the first thing you do is to change spec to something else, then you're
>> not standardizing standard practice, you are adding a new feature that
>> inconveniently clashes with an existing one.
>> >> >
>> >> > In this very thread there are examples showing why taking only the
>> content into account doesn't work but it was brushed off as "that can be
>> fixed".
>> >> >
>> >> > None of this make sense to me.
>> >> >
>> >> >
>> >> > Em qui., 29 de ago. de 2024 18:59, Tiago Freire via Std-Proposals <
>> std-proposals_at_[hidden]> escreveu:
>> >> >>
>> >> >> Again, hashing content... totally unnecessary.
>> >> >>
>> >> >> There's no need to identify "same content" which as far as I can
>> see can be defeated by modifications that don't change the interpretation,
>> like spaces, which although not technically a violation of "same content"
>> it clearly defeats the intent.
>> >> >>
>> >> >> An include summons a resource, a pragma once bars that resources
>> from bey re-summonable. That's it. File paths should be more than enough.
>> >> >>
>> >> >> I'm unconvinced that the "bad cases" are not just a product of bad
>> build architecture, if done properly a compiler should never be presented
>> with multiple alternatives of the same file. And putting such requirements
>> on compilers puts an unnecessary burden on developers to support a scenario
>> that it is that is arguably bad practice.
>> >> >>
>> >> >> The argument is "prgma once" is supported everywhere it is good, we
>> should make it official in the standard, effectively no change to a
>> compiler should occur as a consequence.
>> >> >> If a change needs to occur, then in fact "your version" of what you
>> mean by "pragma once" is actually "not supported" by all the major
>> compilers.
>> >> >>
>> >> >> Current compiler support of "pragma once" and it's usage on cross
>> platform projects have a particular way of dealing with dependencies in
>> mind. That workflow works. It's pointless to have this discussion if you
>> don't understand that flow, and you shouldn't tailor the tool to a workflow
>> that doesn't exist to the detriment of all.
>> >> >>
>> >> >>
>> >> >> ________________________________
>> >> >> From: Std-Proposals <std-proposals-bounces_at_[hidden]> on
>> behalf of Jeremy Rifkin via Std-Proposals <std-proposals_at_[hidden]
>> >
>> >> >> Sent: Thursday, August 29, 2024 9:56:18 PM
>> >> >> To: Tom Honermann <tom_at_[hidden]>
>> >> >> Cc: Jeremy Rifkin <rifkin.jer_at_[hidden]>;
>> std-proposals_at_[hidden] <std-proposals_at_[hidden]>
>> >> >> Subject: Re: [std-proposals] Revising #pragma once
>> >> >>
>> >> >> Performance should be fine if using a content definition. An
>> implementation can do inode/path checks against files it already knows of,
>> as a fast path. The first time a file is #included it’s just a hash+table
>> lookup to decide whether to continue.
>> >> >>
>> >> >> Regarding the filesystem definition vs content definition question,
>> while I think a content-based definition is robust I can see there is FUD
>> about it and also an argument about current practice being a
>> filesystem-based definition. It may just be best to approach this as
>> filesystem uniqueness to the implementation’s ability, with a requirement
>> that symbolic links/hard links are handled. This doesn’t cover the case of
>> multiple mount points, but we’ve discussed that that’s impossible with
>> #pragma once without using contents instead.
>> >> >>
>> >> >> Jeremy
>> >> >>
>> >> >> On Thu, Aug 29, 2024 at 13:06 Tom Honermann <tom_at_[hidden]>
>> wrote:
>> >> >>>
>> >> >>> On 8/28/24 12:32 AM, Jeremy Rifkin via Std-Proposals wrote:
>> >> >>>
>> >> >>> Another question is whether the comparison should be post
>> translation
>> >> >>> phase 1.
>> >> >>>
>> >> >>> I gave this some thought while drafting the proposal. I think it
>> comes
>> >> >>> down to whether the intent is single inclusion of files or single
>> >> >>> inclusion of contents.
>> >> >>>
>> >> >>> Indeed. The proposal currently favors the "same contents" approach
>> and offers the following wording.
>> >> >>>
>> >> >>> A preprocessing directive of the form
>> >> >>> # pragma once new-line
>> >> >>> shall cause no subsequent #include directives to perform
>> replacement for a file with text contents identical to this file.
>> >> >>>
>> >> >>> The wording will have to define what it means for contents to be
>> identical. Options include:
>> >> >>>
>> >> >>> The files must be byte-for-byte identical. This makes source file
>> encoding observable (which I would be strongly against).
>> >> >>> The files must encode the same character sequence post translation
>> phase 1. This makes comparisons potentially expensive.
>> >> >>>
>> >> >>> Note that the "same contents" approach obligates an implementation
>> to consider every previously encountered file for every #include directive.
>> An inode based optimization can help to determine if a file was previously
>> encountered based on identity, but it doesn't help to reduce the costs when
>> a file that was not previously seen is encountered.
>> >> >>>
>> >> >>>
>> >> >>> Tom.
>> >> >>>
>> >> >>> Jeremy
>> >> >>>
>> >> >>> On Tue, Aug 27, 2024 at 3:39 PM Tom Honermann via Std-Proposals
>> >> >>> <std-proposals_at_[hidden]> wrote:
>> >> >>>
>> >> >>> On 8/27/24 4:10 PM, Thiago Macieira via Std-Proposals wrote:
>> >> >>>
>> >> >>> On Tuesday 27 August 2024 12:35:17 GMT-7 Andrey Semashev via
>> Std-Proposals
>> >> >>> wrote:
>> >> >>>
>> >> >>> The fact that gcc took the approach to compare file contents I
>> consider
>> >> >>> a poor choice, and not an argument to standardize this
>> implementation.
>> >> >>>
>> >> >>> Another question is whether a byte comparison of two files of the
>> same size is
>> >> >>> expensive for compilers.
>> >> >>>
>> >> >>> #once ID doesn't need to compare the entire file.
>> >> >>>
>> >> >>> Another question is whether the comparison should be post
>> translation
>> >> >>> phase 1. In other words, whether differently encoded source files
>> that
>> >> >>> decode to the same sequence of code points are considered the same
>> file
>> >> >>> (e.g., a Windows-1252 version and a UTF-8 version). Standard C++
>> does
>> >> >>> not currently allow source file encoding to be observable but a
>> #pragma
>> >> >>> once implementation that only compares bytes would make such
>> differences
>> >> >>> observable.
>> >> >>>
>> >> >>> Tom.
>> >> >>>
>> >> >>> --
>> >> >>> Std-Proposals mailing list
>> >> >>> Std-Proposals_at_[hidden]
>> >> >>> https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Std-Proposals mailing list
>> >> >> Std-Proposals_at_[hidden]
>> >> >> https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals
>> >> >
>> >> > --
>> >> > Std-Proposals mailing list
>> >> > Std-Proposals_at_[hidden]
>> >> > https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals
>> >> --
>> >> Std-Proposals mailing list
>> >> Std-Proposals_at_[hidden]
>> >> https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals
>> >
>> > --
>> > Std-Proposals mailing list
>> > Std-Proposals_at_[hidden]
>> > https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals
>>
>

Received on 2024-09-01 13:11:12