Thank you for this information, I have yet to come across this information.
Unfortunately, things keep changing and my main focus is not processor design, 
I have to focus on other areas and in my spare time look into all processor that have gone further
than VHDL 2 weeks of processor design, we did at university in 2007, and assembly which is not all that difficult, just
require a lot of manual reading of each specific flags and all details, to ensure 100% correct solutions, lets just say I did pretty well this area.

As the article says, it is limited to like 5 ops, back then whatever it may be now and two types compiler and process optimizations.
I do see that micro-ops come with some predefined determination, which is limited to micro-operations which combinations have been predetermined.
masync or mparrallel execution block constraints, in high-level language, loaded with list micro-operations at this level combination supported,
could allow an ide, to limit and suggestion, validate possible new suggestions to the existing predefined patterns for compiler and process code optimization.
Where someone one might see new ways or for peace of algorithm, trade-off that compiler processor, don't have optimization for that the human does.

Maybe if you could hear my argument out here, C/C++ is supposed to be HIgh, but as low level as possible is my current understanding,
however, language doesn't allow a modern ide, to expose lower-level process designs at a high level, where a human can possibly find more optimization that have yet to be thought of yet.
If one were to have the IDE support this concept, then upon compilation, there could be a crowd source set of solutions to different patterns and algorithms/solutions,
in which could analysis and later further be included the compilers as auto-optimization patterns, in which specific individual researching these patterns may not have seen yet.
But out of the 8 billion in the world, there may be a few more out there that other people, with functionality exposed, can take advantage of.
Stats of compiler distributed programs, could also all be sent back to crow source initiative, to determine program hit rate of different patterns, for general-purpose computing optimizations.

I know there is pipeline and all the rest. my one thought one exposing masync{} of some code, would be in more modern-day process where instruction set can have pre-processing pipeline,
which it can process instructions faster and parralize them based on this constraint as much as possible across register or instruction, input, output -> result branch.
almost like converting these to micro-operations, but this sits in dubious middle ground.
Think that if I had 16 X64bit datastore and 8  selector busses, then I could run any combination of that uses 8 selector buses with instruction control logic, provided there is instruction control logic
available to parralize the operations. so say have instruction control logic, that allows to operate on all 8 buses at once or just subset with a stride, then could run 4 comparison control logic using 2 bus per unit,
then run all 8 buses and 4 instruction in parallel. But all 8 busses are wired up to all 16 data store registers, for concurrence.
Mabye does this parralization all the time and just has fixed sync points blocks injected as an instruction or the opposite.

Anyways I done just an idea, need find the right deep dive into arm instruction set for myself now, which see from glance is now quite different in ways.

Hope just maybe ops you up to alternative ways of thinking about how could go about things and maybe crowed source more optermizations.

Kind Regards,

Wesley Oliver

On Tue, Jan 19, 2021 at 6:51 PM Thiago Macieira via Std-Proposals <std-proposals@lists.isocpp.org> wrote:
On Monday, 18 January 2021 08:59:09 PST Wesley Oliver via Std-Proposals wrote:
> Only way I see that happen in the future is if a processor can execute two
> instructions simultaneous, involving different registers.

They do that. Intel processors for the past 10 years have been able to execute
up to 5 instructions (micro-ops) per cycle. That's the theoretical maximum and
won't happen all the time, but it's fairly common to see code execute 2 or 3
instructions per cycle. I believe this is common on AArch64 systems too but I
don't have direct experience.

This is especially true of loop-overhead instructions, since more often than
not they don't depend on the data itself. Range checking certainly qualifies
for this. In fact, because the processor runs pipelined, it will have
concluded the loop has terminated or not terminated 10 to 100 cycles before
the instructions doing the actual work have reached there.

And this is even more true of short loops like yours because the entire work
fits in the CPU structures designed to detect loops and execute them REALLY
fast. See https://www.anandtech.com/show/2594/4 (and note how this article is
from 2008!).

So please study the state of the art first. Your initiative to improve things
is commendable, but if you propose things that were done 15 years ago, you'll
be wasting your time.

[For the nitpickers: yes, I know LSD was turned off in some microarchitectures
because it had problems. That doesn't invalidate the point.]

Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
   Software Architect - Intel DPG Cloud Engineering

Std-Proposals mailing list

Skype: wezley_oliver
MSN messenger: wesley.olis@gmail.com