On Sun, 16 Apr 2023 at 13:14, Thiago Macieira via Std-Proposals <std-proposals@lists.isocpp.org> wrote:
On Sunday, 16 April 2023 12:13:04 -03 Frederick Virchanza Gotham via Std-
Proposals wrote:
> Writing a 44-byte thunk with only two 64-Bit values substituted into it
> is nowhere near as extreme as what they're doing in that paper. If I
> wanted to do what they're doing in that paper, I would just invoke 'g++'
> from my program to create a shared library and then I'd use 'dlopen' on
> the shared library.

Except that their proposal is well-researched to the point of finding obscure
features, and known to work. It comes from people who have a lot of experience
with assembly, ABIs, and compilers. By your own admission, you've just started
learning assembly one month ago.

So please accept free advice when given by people who have a lot more
experience than you in this area.

> When you make a function call on x86_64, the first six int/pointer

I know *exactly* how function calls work on x86-64. My dayjob is not only
knowing x86-64 at this level, but one and two levels below this level (uop/
microarchitectural and the physical implementation). In fact, the reason I am
aware of the SELinux problems I was describing is *because* I was working with
the glibc/binutils team to add an optimisation to Linux/ELF that required a
bit of JIT inside the dynamic loader, and ran afoul of protections.
Fortunately, since their work was an optimisation, it can gracefully fail and
fall back to the existing status quo.

They haven't published the proposal yet, so I can't give you a link.

> Thiago you mentioned that the stack is not executable. This is a
> unnecessary restriction imposed by some operating systems on some
> processes -- it is not a limit of computer science nor of the x86_64
> CPU instruction set.

Correct. It's not a requirement imposed by the CPU on x86_64, but it is a fact
of life. There's no option inside the language to turn that off; it's a
compiler option and each compiler has a different one. I don't know how other
architectures work and whether they even allow this. See Edward's reply about
M32R.

Moreover, non-executable stacks are a security protection feature, preventing
code injection attacks. You will not get anyone to accept turning that off for
new code: the compiler options are retained to deal with old code that
(ab)used the functionality. Here's my advice: don't try to argue this, it will
lead people to dismiss your proposal and it becomes Dead On Arrival. Accept
that the stack is non-executable and move on.

In fact, accept that you cannot have write-and-execute pages at all, anywhere.
The best you can do is mmap() a new, writable page, to which you write the new
code, then you mprotect() it to make it read-only & executable. That means you
have a minimum allocation of a page size (4096 on x86). If your thunk is 44
bytes, then you have a 9200% overhead.

libffi has a nice optimization where it maps the same page at two different addresses, one mapping writeable and the other executable. This is probably even more of an issue for security, though.
 
> If this 'thunk' feature were added to the Standard, it could be
> implemented in two ways:
>   (A) The efficient way, exactly how I've implemented it
>   (B) The inefficient way, by having a thread_local function pointer

I'm pretty sure that neither (A) nor (B) work. Your (A) requires JIT, which
isn't available everywhere. Your (B) requires that only one such state be held
per thread -- so what happens if you try to chain them?

> If a given architecture is unwilling or unable to do A, then the
> following is the inefficient alternative:

Assume that a number indistinguishable from 100% is going to be unable to do
(A), then rephrase your proposal with the benefit versus the actual cost it's
going to have.

> I think the only place where this inefficient implementation could fall
> down is if you have a lambda defined inside a recursive function... but
> you'd just need to make sure that the function pointer is invoked before
> the function is re-entered (unless of course, upon re-entry, you accommodate
> the function pointer not having been invoked yet).

Do so in your proposal.

There's another technique that works as long as you're OK with having a limit to the number of simultaneous thunks outstanding. Basically, instead of a single thread-local data pointer you have a fixed-sized array of them, and a same-sized array of function pointers to hand out to the callback-based API. Demo: https://godbolt.org/z/cGra4Kjxn