Date: Sun, 16 Apr 2023 16:13:04 +0100
On Fri, Apr 14, 2023, Breno GuimarĂ£es wrote:
>
> It looks like there is work around supporting JIT in C++:
> https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1609r1.html
> I'm not sure what is the status of that.
Writing a 44-byte thunk with only two 64-Bit values substituted into it
is nowhere near as extreme as what they're doing in that paper. If I
wanted to do what they're doing in that paper, I would just invoke 'g++'
from my program to create a shared library and then I'd use 'dlopen' on
the shared library.
- - - moving on to the next contributor - - -
Thiago wrote:
> I know you didn't even test your code because it can't work, at least
> not without some more code that you didn't include.
I know you skim my posts on your best days, but anyway I have it tested
and working. I'll go into more detail in the following paragraphs if
you're not too busy cutting the grass or cleaning out the fish tank.
When you make a function call on x86_64, the first six int/pointer
parameters go in the registers rdi, rsi, rdx, rcx, r8, r9, and the rest
are pushed onto the stack (thankfully from right to left, which makes
this all possible). When you invoke a member function on an object,
the hidden 'this' pointer goes in rdi. And so if you want to write a
thunk to remove the hidden 'this' pointer, then you must shift all the
registers down. So you put r9 on the stack, then you put r8 in r9, then
you put rcx in r8, then you put rdx in rcx, and so on. The assembler for
moving the registers down goes like this:
push r9
mov r9,r8
mov r8,rcx
mov rcx,rdx
mov rdx,rsi
mov rsi,rdi
I wrote a thunk that does this, and I got it working:
https://godbolt.org/z/xTfEYo5jE
This 44-byte thunk will work for any lambda whose return type does not
exceed 16 bytes. When the return type is <= 16 bytes, the first
8 bytes go in RAX and the second 8 bytes go in RDX.
If the return type is bigger than 16 bytes, then RAX and RDX are used
totally differently. Normally RDI is used for the hidden 'this' pointer,
but instead it is used to specify the address at which the return value
should be stored. I can demonstrate with the following sample code:
void *volatile global;
struct ReturnTypeA {
long long unsigned a,b,c;
};
struct MyClass {
ReturnTypeA FuncA(void)
{
global = this;
return { 0x1111111111111111, 0x2222222222222222,
0x3333333333333333 };
}
};
which gets compiled to:
// Next two lines set up the stack frame
push rbp
mov rbp,rsp
// The next line sets return value = first argument
mov rax,rdi
// The next line 3 lines appear to get the 'this'
// pointer from RSI and then store it in 'global'
mov QWORD PTR [rbp-0x8],rsi
mov rcx,QWORD PTR [rbp-0x8]
mov QWORD PTR [rip+0x0],rcx # R_X86_64_PC32 global-0x4
// The next 2 lines store a 64-Bit number in rdi+0
mov rcx,0x1111111111111111
mov QWORD PTR [rdi],rcx
// The next 2 lines store a 64-Bit number in rdi+8
mov rcx,0x2222222222222222
mov QWORD PTR [rdi+0x8],rcx
// The next 2 lines store a 64-Bit number in rdi+16
mov rcx,0x3333333333333333
mov QWORD PTR [rdi+0x10],rcx
// The next 2 lines restore the frame pointer and return
pop rbp
ret
So now I know that RSI is used for the hidden 'this' pointer if the
return type exceeds 16 bytes in size. So now I have two different kinds
of thunk code:
Type A = for lambdas whose return type <= 16 bytes
Type B = for lambdas whose return type > 16 bytes
I have written the thunk code for Type B lambdas. I had to alter the
instructions to store the hidden 'this' pointer in RSI instead of RDI.
I got it working:
https://godbolt.org/z/65YEsaT8o
Finally I wanted to write just one 'LambdaThunk' class that would work
for any kind of lambda, irrespective of parameters or return type. In
order to do this I would have to get the size of the return type of the
lambda, and use different code if it's > 16 bytes. I also got this
working, see here:
https://godbolt.org/z/MzYGxfz9Y
Thiago you mentioned that the stack is not executable. This is a
unnecessary restriction imposed by some operating systems on some
processes -- it is not a limit of computer science nor of the x86_64
CPU instruction set.
It took me a weekend to get this working for every conceivable lambda
function on x86_64 computers that use the System V AMD64 ABI calling
convention, and I only really started programming in assembler properly
about a month ago. I reckon a skilled assembler programmer would get
this working for other architectures in just a few hours.
I've written this solution specifically for lambdas but really it would
work on any object to turn a member function pointer into a normal
function pointer (i.e. to remove the hidden 'this' pointer). In fact it
doesn't necessary have to be all about the 'this' pointer, it could be
used to remove _any_ first parameter to a function.
If this 'thunk' feature were added to the Standard, it could be
implemented in two ways:
(A) The efficient way, exactly how I've implemented it
(B) The inefficient way, by having a thread_local function pointer
If a given architecture is unwilling or unable to do A, then the
following is the inefficient alternative:
https://godbolt.org/z/sG1rTbWE6
And here it is copy-pasted:
#include <cassert> // assert
#include <cstddef> // size_t
#include <utility> // forward
template<typename LambdaType>
class LambdaThunk {
protected:
static thread_local LambdaType *p_lambda_object;
template <typename ReturnType, typename... Params>
static ReturnType Actual_Thunk(Params... args)
{
assert( nullptr != p_lambda_object );
return (*p_lambda_object)(std::forward<Params>(args)...);
}
template <typename ReturnType, typename... Params>
static ReturnType (*Get_Thunk_Address(ReturnType
(LambdaType::*)(Params...) const))(Params...)
{
return Actual_Thunk<ReturnType,Params...>;
}
public:
LambdaThunk(LambdaType &obj)
{
p_lambda_object = &obj;
}
auto thunk(void) const volatile // yes this could be a static function
{
return Get_Thunk_Address(&LambdaType::operator());
}
};
template<typename LambdaType>
thread_local LambdaType *LambdaThunk<LambdaType>::p_lambda_object = nullptr;
int Some_Library_Func(int (*const pf)(char const*))
{
return pf("monkey");
}
#include <iostream>
using std::cout;
using std::endl;
int main(int argc, char **argv)
{
auto mylambda = [argc](char const *const p) -> int
{
cout << "Hello " << argc << " " << p << "!" << endl;
return 77;
};
int const z = Some_Library_Func( LambdaThunk(mylambda).thunk() );
cout << "z = " << z << endl;
}
I think the only place where this inefficient implementation could fall
down is if you have a lambda defined inside a recursive function... but
you'd just need to make sure that the function pointer is invoked before
the function is re-entered (unless of course, upon re-entry, you accommodate
the function pointer not having been invoked yet).
>
> It looks like there is work around supporting JIT in C++:
> https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1609r1.html
> I'm not sure what is the status of that.
Writing a 44-byte thunk with only two 64-Bit values substituted into it
is nowhere near as extreme as what they're doing in that paper. If I
wanted to do what they're doing in that paper, I would just invoke 'g++'
from my program to create a shared library and then I'd use 'dlopen' on
the shared library.
- - - moving on to the next contributor - - -
Thiago wrote:
> I know you didn't even test your code because it can't work, at least
> not without some more code that you didn't include.
I know you skim my posts on your best days, but anyway I have it tested
and working. I'll go into more detail in the following paragraphs if
you're not too busy cutting the grass or cleaning out the fish tank.
When you make a function call on x86_64, the first six int/pointer
parameters go in the registers rdi, rsi, rdx, rcx, r8, r9, and the rest
are pushed onto the stack (thankfully from right to left, which makes
this all possible). When you invoke a member function on an object,
the hidden 'this' pointer goes in rdi. And so if you want to write a
thunk to remove the hidden 'this' pointer, then you must shift all the
registers down. So you put r9 on the stack, then you put r8 in r9, then
you put rcx in r8, then you put rdx in rcx, and so on. The assembler for
moving the registers down goes like this:
push r9
mov r9,r8
mov r8,rcx
mov rcx,rdx
mov rdx,rsi
mov rsi,rdi
I wrote a thunk that does this, and I got it working:
https://godbolt.org/z/xTfEYo5jE
This 44-byte thunk will work for any lambda whose return type does not
exceed 16 bytes. When the return type is <= 16 bytes, the first
8 bytes go in RAX and the second 8 bytes go in RDX.
If the return type is bigger than 16 bytes, then RAX and RDX are used
totally differently. Normally RDI is used for the hidden 'this' pointer,
but instead it is used to specify the address at which the return value
should be stored. I can demonstrate with the following sample code:
void *volatile global;
struct ReturnTypeA {
long long unsigned a,b,c;
};
struct MyClass {
ReturnTypeA FuncA(void)
{
global = this;
return { 0x1111111111111111, 0x2222222222222222,
0x3333333333333333 };
}
};
which gets compiled to:
// Next two lines set up the stack frame
push rbp
mov rbp,rsp
// The next line sets return value = first argument
mov rax,rdi
// The next line 3 lines appear to get the 'this'
// pointer from RSI and then store it in 'global'
mov QWORD PTR [rbp-0x8],rsi
mov rcx,QWORD PTR [rbp-0x8]
mov QWORD PTR [rip+0x0],rcx # R_X86_64_PC32 global-0x4
// The next 2 lines store a 64-Bit number in rdi+0
mov rcx,0x1111111111111111
mov QWORD PTR [rdi],rcx
// The next 2 lines store a 64-Bit number in rdi+8
mov rcx,0x2222222222222222
mov QWORD PTR [rdi+0x8],rcx
// The next 2 lines store a 64-Bit number in rdi+16
mov rcx,0x3333333333333333
mov QWORD PTR [rdi+0x10],rcx
// The next 2 lines restore the frame pointer and return
pop rbp
ret
So now I know that RSI is used for the hidden 'this' pointer if the
return type exceeds 16 bytes in size. So now I have two different kinds
of thunk code:
Type A = for lambdas whose return type <= 16 bytes
Type B = for lambdas whose return type > 16 bytes
I have written the thunk code for Type B lambdas. I had to alter the
instructions to store the hidden 'this' pointer in RSI instead of RDI.
I got it working:
https://godbolt.org/z/65YEsaT8o
Finally I wanted to write just one 'LambdaThunk' class that would work
for any kind of lambda, irrespective of parameters or return type. In
order to do this I would have to get the size of the return type of the
lambda, and use different code if it's > 16 bytes. I also got this
working, see here:
https://godbolt.org/z/MzYGxfz9Y
Thiago you mentioned that the stack is not executable. This is a
unnecessary restriction imposed by some operating systems on some
processes -- it is not a limit of computer science nor of the x86_64
CPU instruction set.
It took me a weekend to get this working for every conceivable lambda
function on x86_64 computers that use the System V AMD64 ABI calling
convention, and I only really started programming in assembler properly
about a month ago. I reckon a skilled assembler programmer would get
this working for other architectures in just a few hours.
I've written this solution specifically for lambdas but really it would
work on any object to turn a member function pointer into a normal
function pointer (i.e. to remove the hidden 'this' pointer). In fact it
doesn't necessary have to be all about the 'this' pointer, it could be
used to remove _any_ first parameter to a function.
If this 'thunk' feature were added to the Standard, it could be
implemented in two ways:
(A) The efficient way, exactly how I've implemented it
(B) The inefficient way, by having a thread_local function pointer
If a given architecture is unwilling or unable to do A, then the
following is the inefficient alternative:
https://godbolt.org/z/sG1rTbWE6
And here it is copy-pasted:
#include <cassert> // assert
#include <cstddef> // size_t
#include <utility> // forward
template<typename LambdaType>
class LambdaThunk {
protected:
static thread_local LambdaType *p_lambda_object;
template <typename ReturnType, typename... Params>
static ReturnType Actual_Thunk(Params... args)
{
assert( nullptr != p_lambda_object );
return (*p_lambda_object)(std::forward<Params>(args)...);
}
template <typename ReturnType, typename... Params>
static ReturnType (*Get_Thunk_Address(ReturnType
(LambdaType::*)(Params...) const))(Params...)
{
return Actual_Thunk<ReturnType,Params...>;
}
public:
LambdaThunk(LambdaType &obj)
{
p_lambda_object = &obj;
}
auto thunk(void) const volatile // yes this could be a static function
{
return Get_Thunk_Address(&LambdaType::operator());
}
};
template<typename LambdaType>
thread_local LambdaType *LambdaThunk<LambdaType>::p_lambda_object = nullptr;
int Some_Library_Func(int (*const pf)(char const*))
{
return pf("monkey");
}
#include <iostream>
using std::cout;
using std::endl;
int main(int argc, char **argv)
{
auto mylambda = [argc](char const *const p) -> int
{
cout << "Hello " << argc << " " << p << "!" << endl;
return 77;
};
int const z = Some_Library_Func( LambdaThunk(mylambda).thunk() );
cout << "z = " << z << endl;
}
I think the only place where this inefficient implementation could fall
down is if you have a lambda defined inside a recursive function... but
you'd just need to make sure that the function pointer is invoked before
the function is re-entered (unless of course, upon re-entry, you accommodate
the function pointer not having been invoked yet).
Received on 2023-04-16 15:13:12