If the compiler processes it, there is no need for you for providing assembler. Or for defining how the implementations do it.
It is just compiled or even optimized like normal C++ code.
You can (with a C++ standard proposal) just declare two template functions (to be flexible about the parameters/overload), which are called before a call to the marked function and after a call to the marked function. (The question is, whether the marking may be attributes, as it changes the program)
Then you
- get notified about the call,
- can set a debug breakpoint (if you have some location, which is not optimized away)
- can log the parameters
- get notified about the return and return type
Something like _penter, but inserted on the call site.
https://learn.microsoft.com/en-us/cpp/build/reference/gh-enable-penter-hook-function
-----Ursprüngliche Nachricht-----
Von: Frederick Virchanza Gotham via Std-Proposals <std-proposals@lists.isocpp.org>
Gesendet: Di 14.04.2026 11:42
Betreff: Re: [std-proposals] Interceptor Function (preserve stack and all registers)
An: std-proposals@lists.isocpp.org;
CC: Frederick Virchanza Gotham <cauldwell.thomas@gmail.com>;
Forget about all the [[musttail]] stuff. I think I've settled on a better way of doing this.Firstly, here's how you'd write an interceptor:[[interceptor]] void Logger(void){puts("Function called");goto -> GetProcAddress( LoadLibraryA("graphics.dll"), "RotateMatrix" );}When the compiler encounters the above, it will pretend a few things:(1) Function name will be have "__core_" prepended to it (i.e. __core_Logger). The name will be mangled, so it can be a template function inside a namespace.(2) The signature will be changed to "auto (*)(void) -> void(*)(void)"(3) The "goto ->" will be changed to "return"So here's what you'll have:auto __core__Z6Loggerv(void) -> void(*)(void){puts("Function called");return dlsym( dlopen("graphics", RTLD_NOW), "RotateMatrix" );}Next the compiler will emit a second function in assembler named '_Z6Loggerv' as follows:_Z6Loggerv:save_call_statecall __core__Z6Loggervstash_address_in_scratch_registerrestore_call_statejmp *scratch_registerOf course, 'save_call_state' will differ by architecture (e.g. x86_64 Vs aarch64), and also differ by convention (e.g. SystemV Vs msabi). While I can't write 'save_call_state' once for every architecture, I *can* write it so that it works for all known calling conventions for the selected architecture (so for example on x86_32, it will work for cdecl, stdcall, thiscall).I am implementing this first of all for x86_64. The registers needed for SystemV are a superset of the registers needed for msabi, so I just save all the SystemV registers. Msabi requires 32 bytes of shadow space on the stack before a function call, so I've added that in (it's harmless on SystemV).Some x86_64 processors only have SSE, some have AVX-256, and some have AVX-512, so at runtime I need to execute the CPUID instruction to find out. Here's the assembler I've got for all this at the moment:# ============================================================# PUSH_ARGS_AND_MODE## Saves:# GP: rax, rdi, rsi, rdx, rcx, r8, r9# mode tag:# 1 = SSE (xmm0..xmm7)# 2 = AVX (ymm0..ymm7)# 3 = AVX512 (zmm0..zmm7 + k0..k7)## Notes:# - CPUID is executed every entry.# - RBX is preserved around CPUID because CPUID clobbers it.# - RSP starts 8 mod 16 on both SysV and Microsoft x64 entry.# - The chosen frame sizes keep RSP 0 mod 16 before calling __core_Logger# ============================================================.macro PUSH_ARGS_AND_MODE# Save inbound GP argument state first.# Offsets:# 0 rax# 8 rdi# 16 rsi# 24 rdx# 32 rcx# 40 r8# 48 r9sub $56, %rspmov %rax, 0(%rsp)mov %rdi, 8(%rsp)mov %rsi, 16(%rsp)mov %rdx, 24(%rsp)mov %rcx, 32(%rsp)mov %r8, 40(%rsp)mov %r9, 48(%rsp)# Preserve RBX around CPUID.push %rbx# Detect AVX / AVX-512 each time.# Default fallback is SSE.mov $1, %eaxcpuidbt $27, %ecx # OSXSAVEjnc .Lpush_sse\@bt $28, %ecx # AVXjnc .Lpush_sse\@xor %ecx, %ecxxgetbv # XCR0 -> edx:eaxmov %eax, %r10dand $0x6, %r10d # bits 1,2 => XMM+YMMcmp $0x6, %r10djne .Lpush_sse\@mov $7, %eaxxor %ecx, %ecxcpuidbt $16, %ebx # AVX512Fjnc .Lpush_avx\@xor %ecx, %ecxxgetbvand $0xE6, %eax # bits 1,2,5,6,7cmp $0xE6, %eaxje .Lpush_avx512\@.Lpush_avx\@:pop %rbx# Layout after second subtraction:# 0 mode = 2# 8 ymm0# 40 ymm1# 72 ymm2# 104 ymm3# 136 ymm4# 168 ymm5# 200 ymm6# 232 ymm7# 264 pad# 272 saved GP block (56 bytes)## Additional size = 272, total frame = 328 bytes.sub $272, %rspmovq $2, 0(%rsp)vmovdqu %ymm0, 8(%rsp)vmovdqu %ymm1, 40(%rsp)vmovdqu %ymm2, 72(%rsp)vmovdqu %ymm3, 104(%rsp)vmovdqu %ymm4, 136(%rsp)vmovdqu %ymm5, 168(%rsp)vmovdqu %ymm6, 200(%rsp)vmovdqu %ymm7, 232(%rsp)jmp .Lpush_done\@.Lpush_avx512\@:pop %rbx# Layout after second subtraction:# 0 mode = 3# 8 zmm0# 72 zmm1# 136 zmm2# 200 zmm3# 264 zmm4# 328 zmm5# 392 zmm6# 456 zmm7# 520 k0# 528 k1# 536 k2# 544 k3# 552 k4# 560 k5# 568 k6# 576 k7# 584 pad# 592 saved GP block (56 bytes)## Additional size = 592, total frame = 648 bytes.sub $592, %rspmovq $3, 0(%rsp)vmovdqu64 %zmm0, 8(%rsp)vmovdqu64 %zmm1, 72(%rsp)vmovdqu64 %zmm2, 136(%rsp)vmovdqu64 %zmm3, 200(%rsp)vmovdqu64 %zmm4, 264(%rsp)vmovdqu64 %zmm5, 328(%rsp)vmovdqu64 %zmm6, 392(%rsp)vmovdqu64 %zmm7, 456(%rsp)kmovq %k0, 520(%rsp)kmovq %k1, 528(%rsp)kmovq %k2, 536(%rsp)kmovq %k3, 544(%rsp)kmovq %k4, 552(%rsp)kmovq %k5, 560(%rsp)kmovq %k6, 568(%rsp)kmovq %k7, 576(%rsp)jmp .Lpush_done\@.Lpush_sse\@:pop %rbx# Layout after second subtraction:# 0 mode = 1# 8 xmm0# 24 xmm1# 40 xmm2# 56 xmm3# 72 xmm4# 88 xmm5# 104 xmm6# 120 xmm7# 136 pad# 144 saved GP block (56 bytes)## Additional size = 144, total frame = 200 bytes.sub $144, %rspmovq $1, 0(%rsp)movdqu %xmm0, 8(%rsp)movdqu %xmm1, 24(%rsp)movdqu %xmm2, 40(%rsp)movdqu %xmm3, 56(%rsp)movdqu %xmm4, 72(%rsp)movdqu %xmm5, 88(%rsp)movdqu %xmm6, 104(%rsp)movdqu %xmm7, 120(%rsp).Lpush_done\@:.endm# ============================================================# POP_ARGS_AND_JUMP## Expects:# r11 = final jump target returned by __core_Logger## Restores according to mode tag, then jumps through r11.# ============================================================.macro POP_ARGS_AND_JUMPcmpq $1, 0(%rsp)je .Lpop_sse\@cmpq $2, 0(%rsp)je .Lpop_avx\@cmpq $3, 0(%rsp)je .Lpop_avx512\@ud2.Lpop_sse\@:movdqu 8(%rsp), %xmm0movdqu 24(%rsp), %xmm1movdqu 40(%rsp), %xmm2movdqu 56(%rsp), %xmm3movdqu 72(%rsp), %xmm4movdqu 88(%rsp), %xmm5movdqu 104(%rsp), %xmm6movdqu 120(%rsp), %xmm7mov 144(%rsp), %raxmov 152(%rsp), %rdimov 160(%rsp), %rsimov 168(%rsp), %rdxmov 176(%rsp), %rcxmov 184(%rsp), %r8mov 192(%rsp), %r9add $200, %rspjmp *%r11.Lpop_avx\@:vmovdqu 8(%rsp), %ymm0vmovdqu 40(%rsp), %ymm1vmovdqu 72(%rsp), %ymm2vmovdqu 104(%rsp), %ymm3vmovdqu 136(%rsp), %ymm4vmovdqu 168(%rsp), %ymm5vmovdqu 200(%rsp), %ymm6vmovdqu 232(%rsp), %ymm7mov 272(%rsp), %raxmov 280(%rsp), %rdimov 288(%rsp), %rsimov 296(%rsp), %rdxmov 304(%rsp), %rcxmov 312(%rsp), %r8mov 320(%rsp), %r9add $328, %rspjmp *%r11.Lpop_avx512\@:vmovdqu64 8(%rsp), %zmm0vmovdqu64 72(%rsp), %zmm1vmovdqu64 136(%rsp), %zmm2vmovdqu64 200(%rsp), %zmm3vmovdqu64 264(%rsp), %zmm4vmovdqu64 328(%rsp), %zmm5vmovdqu64 392(%rsp), %zmm6vmovdqu64 456(%rsp), %zmm7kmovq 520(%rsp), %k0kmovq 528(%rsp), %k1kmovq 536(%rsp), %k2kmovq 544(%rsp), %k3kmovq 552(%rsp), %k4kmovq 560(%rsp), %k5kmovq 568(%rsp), %k6kmovq 576(%rsp), %k7mov 592(%rsp), %raxmov 600(%rsp), %rdimov 608(%rsp), %rsimov 616(%rsp), %rdxmov 624(%rsp), %rcxmov 632(%rsp), %r8mov 640(%rsp), %r9add $648, %rspjmp *%r11.endm# ============================================================# Public stub## Microsoft x64 requires 32 bytes of shadow space for the call.# Reserving it here is harmless on SysV if alignment stays correct.# ============================================================Logger:PUSH_ARGS_AND_MODEsub $32, %rspcall __core_Loggeradd $32, %rspmov %rax, %r11POP_ARGS_AND_JUMPSo I think I've got a decent plan here for the implementation. This will successfully intercept any function call on x86_64 (even if binaries are mixing calling conventions).-- Std-Proposals mailing list Std-Proposals@lists.isocpp.org https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals