sg15: [Tooling] A possible sg15 topic: uniform description of data required for source-level tools

From: Dmitry.Kozhevnikov_at_[hidden] <dmitry.kozhevnikov_at_[hidden]>
Date: Wed, 8 Aug 2018 23:27:02 +0300

Hi everyone!

I'm working on CLion IDE at JetBrains, and, if everything will go well, I hope to attend the San-Diego meeting, that would me my first meeting ever. I'm mostly interesting in tooling, and I have several ideas about how the C++ tooling landscape could be improved, so I'd like to have some feedback if there problems hit home for anyone, and if it it's worth it for me to put together a paper.

One of the more painful tasks when building a source-level tool (i.e. an IDE, or a source-to-source utility) is to collect the compiler/toolchain information required to properly analyze the source:
- header search paths
- built in and user-defined preprocessor definitions
- language features available (i.e. for this specific gcc version with `-std=gnu2a`, is `requires` a keyword or still an identifier?)
- compiler intrinsics (i.e. what is `__builtin_types_compatible_p` and how should we parse it?)

Here is some tangentially related thread in cfe-dev, which describes some problems and partial solutions: http://lists.llvm.org/pipermail/cfe-dev/2018-April/057683.html <http://lists.llvm.org/pipermail/cfe-dev/2018-April/057683.html>

Figuring everything out requires either:
1. intimate knowledge of the various compiler drivers (how to query it for features and extensions)
2. a pre-populated database of such information, so you can pick one and hope it's correct (i.e. you can try to guess proper clang's target triple for a given toolchain, but you can't know beforehand if it's exist and if it's actually match the toolchain you're given).

You might say: "everyone is using real compilers to parse C++, they know it all anyway" - they only do know it about themselves, or toolchains they're able cross compile to, for example:
- clang-based tool might have troubles with more exotic compilers like Intel, Green Hills, or clang version which is more recent than used in the tool
- InteliiSense in MSVC (which is, AFAIK, EDG-based) might have troubles with remote projects using a fairly old or fairly new gcc

Another related problem is that currently there is very complicated to reason about conditionally-uncompiled code if you don't have access to required toolchain:

#ifdef _WIN32
    int x = foo(); // it's complicated to find this usage when cross-referencing the `foo` symbol if you're in an IDE on Linux
#endif

So that's what I'm thinking of: it would be great to have a standardized and universally-agreed way to describe everything that is needed to parse a C++ file. This description could be generated eigher generated on demand using an actual compiler used for a specific file, or even distributed with a project (for toolchains that the IDE/tool might not have access to).

As a very rough draft, it could look like a JSON object like:

{
       "file_path": "file.cpp",
       "user_macros": [
           { "X" : "", }
           { "Y" : "1" },
           ...
       ],
       "builtin_macros": [
           { "__GNUC__" : "4" },
           ...
       ],
       "builtin_macro_predicates": [
           {
               "__has_feature" : ["cxx_lambdas", "cxx_modules"],
               "__has_extension" : ["cxx_lambdas", "cxx_modules"],
               "__has_builtin" : ["__type_pack_element"]
           },
           ...
       ],
       "function_like_builtins": ["__builtin_offsetof", "__builtin_offsetof", ...],
       "template_alias_like_builtints": ["__type_pack_element", ...],
       "features": { "exceptions" : true, "concepts" : false, ... },

       "type_sizes" : {
           "int" : 4,
           "long": 8,
           "char": 1,
           ...
       },

       "header_search_paths": [
           { path: "target/p1", "builtin": 1, "quote": 0 },
           { path: "target/p2", "builtin": 0, "quote": 1 },
       ],

       "compiler_version": "...",
       "compiler_executable": "...",
       "working_directory": "..."
   }

Of course, it would be much bigger (it’ll contain roughly everything which is required for a syntax-only pass of a compiler frontend).

An interesting question is what to do with various intrinsics and builtins. For example, they could be mentioned, and also annotated with some properties (i.e. this one is function-like, and that one is a "function" that take types and return a value). So if a tool/IDE knows how to handle it exactly, it will; if not, it could at least recover during the parse way more gracefully than just treating it as an unknown identifier.

Q: How can we get such data?

1. For new and collaborating compilers, they can produce it themselves (i.e. this is a step in this direction: https://reviews.llvm.org/rL333653 <https://reviews.llvm.org/rL333653>)
2. For older or non-collaborating compilers, there could be a community-maintained tool which would aggregate all the knowledge about it's possible arguments, driver quirks, output formats, available builtins, etc. (I have a private prototype of this tool which I'm trying to use for IDE regression tests, however, it's very far from being useful yet. I hope to open source it sooner or later.)

TLDR: What are the benefits?

1. An arbitrary IDE would be able to work with an arbitrary compiler (given it provides all the required info, or someone (i.e. the compiler author themselves), had contributed everything required to a community-maintained tool). This would, hopefully, lead to better tools adoption, and will share some tool author's maintenance burden with the rest of the community :)
2. It opens a possibility to have some proper code insight for configurations you're not able to build locally.

What do you think? Does it all make sense? Should I put more effort in it an try to compose a paper?

Best regards,
Dmitry Kozhevnikov

Received on 2018-08-08 22:27:07